Compiling multiple functions together

ccleve commented 1 year ago

I have a fairly large pgrx extension that implements an index access method. I'd love to run it on AWS RDS or Aurora, but can't because they don't support untrusted extensions.

Theoretically, I could get rid of the unsafe code, publish it as a crate, and include it as a dependency in one or more PL/Rust functions. The difficulty is that I have a large number of pg_extern functions, and I expect that each would have to be added as a PL/Rust function that just calls a function in the crate. How would this work, though, if each PL/Rust function gets compiled independently? Is there some provision for sharing dependencies among multiple functions, and compiling them all together into a single large extension?

workingjubilee commented 1 year ago

eeeebbbbrrrr commented 1 year ago

Yeah. I have a big extension that implements the IAM API too. I can relate.

The tl;dr version is that this isn’t going to work as you’re not going to be able to eliminate 100% of the unsafe blocks. Just defining the amhandler function would need to be unsafe. So it’s kinda DOA.

What might be interesting is for plrust the extension to somehow expose an IAM wrapper that could be implemented safely with “LANGUAGE plrust” functions.

I haven’t put any thought into what that’d look like or how practical it would be, but it seems like a tractable idea. The IAM API is pretty simple.

How/where does your index store data?

Maybe this is an idea we can discuss more. I cannot predict what AWS might allow but that shouldn’t stop us from thinking about this more.

ccleve commented 1 year ago

I write pages directly into the indexrel using some C code that calls ReadBuffer / BufferGetPage. Yes, that would be another problem to solve. AWS shouldn't object to it, though, because it doesn't write to disk directly. I posted this a while ago, but never followed up: https://github.com/pgcentralfoundation/pgrx/issues/294

I note that pg_tle does allow for hooks: https://github.com/aws/pg_tle/blob/main/docs/04_hooks.md, which presumably means that AWS does not object to callbacks in general. An amhandler is just another callback.

I do wonder what Aurora has done under the hood, and how much of the underlying data access code they have swapped out for something else. That may limit how deep they'll let us go.

eeeebbbbrrrr commented 1 year ago

AWS shouldn't object to it, though, because it doesn't write to disk directly

I can't speak for them but I think they'd object to anything that can't be declared through plain-text DDL. No way they're gonna allow "some C code" on an RDS instance.

I haven't spent any time with pg_tle, so I'm completely unfamiliar with what it does and how it works.

I had a thought trying to get to sleep last night. Are you familiar with (what used to be) multicorn (https://multicorn.org)? I wonder if we built something similar as a pgrx-based extension to allow implementing the IAM API using any "CREATE FUNCTION" function, regardless of language. That's kinda an evolution of my idea above to build something into plrust.

There'd be hurdles around sharing the IAM argument/return type definitions in a cross-function/cross-language way, but it seems doable. And we'd definitely need to get safe wrappers in pgrx around all of Postgres internal Buffer/Page management stuff.

ccleve commented 1 year ago

I can't speak for them but I think they'd object to anything that can't be declared through plain-text DDL. No way they're gonna allow "some C code" on an RDS instance.

No, I wouldn't think so. Pgrx would have to wrap it, as you suggest. The tricky thing is that we might have to add a higher-level function around the Buffer/Page stuff, because the pg_guard overhead of calling from Rust to C is significant. Here is one of the functions that I use:

/*
 * Append a page to the end of the file and write the data to it.
 */
uint32_t rdb_append_page(Relation rel, char *data) {
    LockRelationForExtension(rel, ExclusiveLock);

    Buffer buf = ReadBuffer(rel, P_NEW);
    BlockNumber newblk = BufferGetBlockNumber(buf);

    LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);

    START_CRIT_SECTION();
    Page page = BufferGetPage(buf);
    memcpy(page, data, BLCKSZ); // overwrite page with data
    MarkBufferDirty(buf);
    UnlockReleaseBuffer(buf);
    END_CRIT_SECTION();

    /* another method:
    // number of first block past end
    BlockNumber blocknum = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
    smgrwrite(rel->rd_smgr, MAIN_FORKNUM, blocknum, data, false);
    */

    UnlockRelationForExtension(rel, ExclusiveLock);
    return newblk;
}

I'm happy to write a bunch of wrapper functions and do a PR, if you like.

I haven't spent any time with pg_tle, so I'm completely unfamiliar with what it does and how it works.

I'm confused -- I thought that was what PL/Rust was based on? https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL_trusted_language_extension.html https://github.com/aws/pg_tle

I'm not familiar with multicorn. What does it do that is different?

ccleve commented 1 year ago

I'm not familiar with multicorn. What does it do that is different?

Never mind. I get it. My only hesitation with this approach is the performance overhead. If we can do a zero-cost abstraction, fine, but if it slows down any function that needs to get called thousands of times per query then it will be a problem. amhandler->amgettuple would be such a function.

eeeebbbbrrrr commented 1 year ago

I'm happy to write a bunch of wrapper functions and do a PR, if you like.

This is probably something large enough to warrant sketching out a bit first before writing a bunch of code that we might not end up merging. I haven't put a lot of thought into safe wrappers around buffers and pages and such, so I don't even have a clue as to what it might look like.

I'm confused -- I thought that was what PL/Rust was based on?

pg_tle is an AWS thing for packaging pure SQL as an "extension". I suppose the technical reason it exists is because RDS users don't have access to write their extension.sql file to the $(pg_config --sharedir)/extension/ directory. And I guess it provides some SQL-level hooks for other certain things.

For plrust, RDS installs the actual plrust compiled extension -- it's unrelated to pg_tle. Tho I suppose you can use pg_tle to package and manage a SQL extension that uses LANGUAGE plrust functions.

We do have some contacts there and some of them hang out on our discord (I think you're a member, yeah?). So it might be beneficial to bring this idea up in one of the #plrust channels. They've given us a few PRs here for plrust.

if it slows down any function that needs to get called thousands of times per query then it will be a problem

Sure. I mean, initially, that could conflict with "I'd love to run it on AWS RDS or Aurora" tho. Getting something working is probably step one. Then we can sort out bottlenecks like per-call overhead.

I feel like if someone were to invent a solid pgrx-based extension that provides an abstract (and safe) API for implementing an IAM through external functions (regardless of language), then the cloud providers would take notice. At that point they'd either adopt it as-is or start offering help to make it ready for their hosted environments.

tcdi / plrust

Compiling multiple functions together #374