tada / pljava

PL/Java is a free add-on module that brings Java™ Stored Procedures, Triggers, Functions, Aggregates, Operators, Types, etc., to the PostgreSQL™ backend.
http://tada.github.io/pljava/
Other
238 stars 77 forks source link

Current status of custom FDW? #430

Open beargiles opened 1 year ago

beargiles commented 1 year ago

What's the current status of creating a FDW? I know you were working on it when we chatted over a year ago and had made some progress but still had a way to go.

Is it something where subgoals would make sense? E.g., supporting a read-only FDW first, then adding support for 'insert' and 'update'?

Is there anything that someone could help with?

For people with no idea what I'm referring to - a "FDW" = "Foreign Data Wrapper". In my side-project I have user-defined types that can hold cryptographic material like digital certificates (aka SSL certs) and encrypted private keys. They need to be loaded somehow - I can create user-defined functions that read them from local files.

However I think a cleaner approach would be creating a custom FDW for them - you would specify the location of the file and it's top-level encryption key when you define the FDW, but could then treat it like any other database when you're looking for encryption keys. This is somewhat more secure than keeping the values within the database itself since a database dump won't include them and the top-level encryption key used with the FDW should have been provided by something outside of the database itself. (An attacker could still get the keys from a filesystem dump but the top-level encryption key won't be included in that dump.)

There's also a modern twist since many sites will now prefer to use something like Hashicorp Vault instead of deploying files to the server instances. In this case the FDW would make making a call to an external REST service, not reading a local file, but the resulting 'table(s)' should look the same.

Here's the required callbacks: https://www.postgresql.org/docs/current/fdw-callbacks.html

jcflack commented 1 year ago

What I remember from looking at it last was that the callbacks involve a bunch of native PG structs to be somehow made available for manipulation in Java.

That makes it something likely to be much more pleasant to implement over the work currently in progress in PR #399. I don't know if you've had a look at that yet.

beargiles commented 1 year ago

Okay. I have the attention span of a squirrel (thanks ADHD!) but I can probably hit a few milestones below:

Advanced read-only:

You can push queries onto the FDW. By default PostgreSQL will load everything and then perform any filtering itself. If you can push (supported) queries onto the FDW then it can perform that filtering itself.

I consider this more important than adding read-write since my primary objective is to support extraction of cryptographic credentials from standard external files and the immediately putting them into a corresponding UDT. For obvious reasons I don't want the FDW to return everything.

(In fact an unrestricted query should probably result in an error - convention be damned!)

For most developers the main concern will be efficiency. There's no point in burning cycles and network bandwidth to provide data that will be immediately dropped.

jcflack commented 1 year ago

This'll probably need a good design document that evolves for a while before code is written. At first glance, it appears to me:

beargiles commented 1 year ago

I should have something soon.

I don't know what - but as several people have pointed out to me recently "some progress is better than no progress" (when seen from the outside) so even a dummy FDW that does nothing but provide some canned values would be a big step forward since it's something other people could build on. If nothing else it gets the idea past the initial hurdle of simply having something to modify vs. creating something from scratch in a complex environment.

The "minimal useful functionality" is probably the ability to replace a stored procedure that returns a cursor. Static (or null) values for things like cost, estimated record size, etc.

However this could result in a game changer since a FDW is tied to the server, not session or query, so it will have a single long-running JVM. I don't know if there's still a separate JVM per session (or even per query?) but having a long-lived JVM means you can start doing things like internal caching or speculative execution.

As for the callbacks - Holy Ramen Noodles you're right about that - but if you look the dummy implementations or even the PostgreSQL FileFDW implementation you'll see that the FDW only needs to implement around 8 functions for read-only access, and it looks like that will translate to only a few java methods. Basically an implementation of 'scan' and a way to get some metadata. I can look at your existing code to see how you load the jar, call a specific class/method, and create the resulting Tuple.

Re speculative execution. Some of the existing FDW wrap REST calls to external servers. In many cases you don't need to provide any information in the request and it's okay if the data is a little outdated. Think servers that provide routine weather information, tv schedules, etc.

In that case you can dramatically improve performance by performing speculative execution and caching the results. E.g., the FDW might specify that the cache should be updated every 10 minutes -- so the FDW sets up a loop so it makes the REST call and caches the results every 10 minutes. That cache is used when the user queries the FDW tables - they'll get an immediate response instead of waiting for a REST call to complete.

You could certainly write a stored procedure that makes the REST call and updates a PostgreSQL table, and a second stored procedure that reads from that table. (You would do this instead of reading from the table directly for encapsulation, logging access, etc.) But I can't think fo an easy way to perform the periodic updates short of setting up a cron task.

jcflack commented 1 year ago

However this could result in a game changer since a FDW is tied to the server, not session or query, so it will have a single long-running JVM. I don't know if there's still a separate JVM per session (or even per query?)

PL/Java itself is a JVM per session (well, per session where PL/Java is used).

I was not picturing a FDW behaving inherently differently. In what process do you picture such a JVM running, and what would be responsible for starting/stopping that process and the JVM within it?

beargiles commented 1 year ago

This is one reason for the steps I mentioned - it provides a way to explore that question before committing to a full implementation.

Loading and unloading the FDW's JVM is easy - there are _PG_init(void) and _PG_fini(void) hooks. There are several options on finding the JVM binary:

So I consider that a solved problem. It needs to implemented but there are several clear paths forward.

However there are many other questions that can best be resolved by testing a minimal working implementation. Prior research is important, of course, but that assumes the people who wrote the documentation anticipated this use case.

Three obvious questions

The second question is because I can see scenarios where a per-session JVM might want to use a persistent JVM for some tasks. Caching wouldn't be a good idea - it would break on a clustered database.

A better example would be access to a remote system that requires high security. (Think Kerberos, OAuth2, mutual TLS authentication, etc.) You could implement Kerberos and mutual TLS in the per-session JVM - although a careful security review might be concerned about access to the required credentials. However OAuth2 requires a HTTP/S callback to get the initial token and to renew it and that introduces uncertainty into the per-session JVM. E.g., how long should it wait for a response?

In this case I think a hybrid approach where the FDW is responsible for obtaining and (pro-actively) renewing the credentials and the per-session JVM retrieves it (e.g., via a standard SQL query) would make much more sense.

That said... it's not a big leap from "the per-session JVM makes the network call using credentials" to "the per-session JVM directly passes the information to the FDW and it makes the network call. The credentials are never exposed."

You could handle this using Tuples - the per-session JVM calls "INSERT .... RETURNING ..." so it's already an option once you support a read/write FDW. But that exposes the contents to any auditing tools with access to the actual data flow and I could see the security folks insisting on a direct call if possible.

(Hmm.... vague memories of the Java 1.0 java.rpc classes.... That might be the way to provide direct access despite using separate JVMs.)

jcflack commented 1 year ago

Loading and unloading the FDW's JVM is easy - there are _PG_init(void) and _PG_fini(void) hooks. There are several options on finding the JVM binary:

Sure. In what OS process, though, do you envision this happening?

PL/Java does it in the backend process serving your session.

beargiles commented 1 year ago

After that essay... I should be clear that my current focus is much more limited. Think "FileFDW" but with two extensions:

The FDW implementation will be a skeleton with a single required option - the location of the jar file. (Ideally something already uploaded via sqlj, but initially it will be an external jar.

The jar will need to implement one or more interfaces - TBD.

The FDW skeleton can either use the standard METADATA-INF entry to locate the implementation, or it could scan the jar for all classes that implement it.

That's enough for my immediate needs, and I suspect enough for other people.

Everything else is "hmm, that could be useful since I've encountered a similar task at work" but definitely far beyond the scope of what I'm looking at at them moment. However these are important questions to ask in order to avoid implementations that fix the immediate problem but can't be extended.

beargiles commented 1 year ago

The more I think about it the more I'm convinced that there's some way to get the session at the appropriate time, e.g., when the scan is initialized. It's just too useful to not provide.

In the meanwhile, or if we want to stick with a single JVM, the _PG_init() function can make a fork-exec call. The parent process writes the PID to the usual location, with some mechanism that supports multiple instances. (Future task....). This is a good idea even if the JVM launcher performs its own fork-exec since it gives us a place to hang additional resources that can be controlled via IPC.

_PG_fini() retrieves the PID and calls kill() so the JVM can perform a clean shutdown.

jcflack commented 1 year ago

The more I think about it the more I'm convinced that there's some way to get the session at the appropriate time, e.g., when the scan is initialized.

I'm becoming a little concerned: there's enough vagueness in what you've written there that I am not sure what your mental model is of how PostgreSQL itself works.

In the meanwhile, or if we want to stick with a single JVM, the _PG_init() function can make a fork-exec call.

PL/Java starts up from _PG_init() in the backend process serving your session. So that's already a JVM per session (that uses PL/Java). Doing a fork there to create another process and another JVM in it would lead to two processes and two JVMs per session, which I don't think anyone is pining for, and I don't think would be at all necessary for the example uses you've indicated.

I don't think there's anything about a `FileFDW-with-some-Java-manipulations that can't be done happily within the same JVM that PL/Java is already going to start for your session, and I would urge focusing on a solution that will do that, as doing anything else will be a significant complexity jump.

The jar will need to implement one or more interfaces - TBD.

I would also suggest front-loading the effort to at least sketch out those interfaces. They're likely to need some iteration, and some difficulties may be identified early that way.

a single required option - the location of the jar file. (Ideally something already uploaded via sqlj, but initially it will be an external jar.

Because PL/Java already supplies facilities for loading jars, defining class paths, and associating those with PostgreSQL schemas, I would suggest instead that the options name a PostgreSQL schema (thereby, its class path) and the name of a class that implements the required interface.

That way, you can put your FDW implementation all in one jar, or have it and supporting libraries in separate jars, or include it in the jars of your other PL/Java functionality, and you don't have to reinvent any wheels that PL/Java already has.

jcflack commented 1 year ago

... also, while a schema and class name would be required in all cases, it will be important for additional options to be accepted. The Java interface needs to include a validation method to make sure any additional options and values supplied make sense to that FDW.