tada / pljava

PL/Java is a free add-on module that brings Java™ Stored Procedures, Triggers, Functions, Aggregates, Operators, Types, etc., to the PostgreSQL™ backend.
http://tada.github.io/pljava/
Other
238 stars 77 forks source link

Work-in-Progress on PL/Java refactoring, API modernization #399

Open jcflack opened 2 years ago

jcflack commented 2 years ago

As a work-in-progress pull request, this is not expected to be imminently merged, but is here to document the objectives and progress of the ongoing work.

Why needed

A great advantage promised by a PL based on the JVM is the large ecosystem of languages other than Java that can be supported on the same infrastructure, whether through the Java Scripting (JSR 223) API, or through the polyglot facilities of GraalVM, or simply via separate compilation to the class file format and loading as jars.

However, PL/Java, with its origins in 2004 predating most of those developments, has architectural limitations that stand in the way.

JDBC

One of the limitations is the centrality of the JDBC API. To be sure, it is a standard in the Java world for access to a database, and for PL/Java to conform to ISO SQL/JRT, the JDBC API must be available. But it is not necessarily a preferred or natural database API for other JVM or GraalVM languages, and its design goal is to abstract away from the specifics of an underlying database, which ends up complicating or even preventing access to advanced PostgreSQL capabilities that could be prime drivers for running server-side code in the first place.

The problem is not that JDBC is an available API in PL/Java, but that it is the fundamental API in PL/Java, with its tentacles reaching right into the native C language portion of PL/Java's implementation. That has made alternative interface options impractical, and multiplied the maintenance burden of even simple tasks like adding support for new datatype mappings or fixing simple bugs. There are significant portions of JDBC 4 that remain unimplemented in PL/Java.

Experience building an implementation of ISO SQL/XML XMLQUERY showed that certain requirements of the spec were simply unsatisfiable atop JDBC, either because of inherent JDBC limitations or limits in PL/Java's implementation of it. An example of each kind:

Those cases also illustrate that mapping some PostgreSQL data types to those of another language can be complex. An arbitrary PostgreSQL INTERVAL is representable as neither a java.time.Period nor a java.time.Duration alone (though a pair of the two can be used, a type that PGJDBC-NG offers). One or the other can suffice if the type modifier is known and limits the fields present. A PostgreSQL NUMERIC value has not-a-number and signed infinity values that some candidate language-library type might not, and an internal precision that its text representation does not reveal, which might need to be preserved for a mathematically demanding task. The details of converting it to another language's similar type need to be knowable or controllable by an application.

It is a goal of this work to give PL/Java an API that does not obscure or abstract from PostgreSQL details, but makes them accessible in a natural Java idiom, and that such a "natural PostgreSQL" API should be adequate to allow building a JDBC layer in pure Java above it. (The work of building such a JDBC layer is not in the scope of this pull request.)

Parameter and return-value mapping

PL/Java uses a simple, Java-centric approach where a Java method is declared naturally, giving ordinary Java types for its parameters and return, and the mappings from these to the PostgreSQL parameter and return types are chosen by PL/Java and applied transparently (and much of that happens deep in PL/Java's C code).

While convenient, that approach isn't easily adapted to other JVM languages that may offer other selections of types. Even for Java, it stands in the way of doing certain things possible in PostgreSQL, like declaring VARIADIC "any" functions.

In a modernized API, it needs to be possible to declare a function whose parameter represents the PostgreSQL FunctionCallInfo, so that the parameters and their types can be examined and converted in Java. That will make it possible to write language handlers in Java, whether for other JVM languages or for the existing PL/Java calling conventions that at present are tangled in C.

Elements of new API

Identification of data types

A PostgreSQL-specific API must be able to refer unambiguously to any type known to the database, so it cannot rely on any fixed set of generic types such as JDBCType. To interoperate with a JDBC layer, though, the identifier for types should implement JDBC's SQLType interface.

The API should support retrieving enough metadata about the type for a JDBC layer implemented above it to be able to report complete ResultSetMetaData information.

The new class serving this purpose is RegType.

As RegType implements the java.sql.SQLType interface, an aliasing issue arises for a JDBC layer. Such a layer should accept JDBCType.VARCHAR as an alias for RegType.VARCHAR, for example. JDBC itself has no methods that return an SQLType instance, so the question of whether it should return the generic JDBC type or the true RegType does not arise. A PL/Java-specific API is needed for retrieving the type identifier in any case.

The details of which JDBC types are considered aliases of which RegTypes will naturally belong in a JDBC API layer. At the level of this underlying API, a RegType is what identifies a PostgreSQL type.

While RegType includes convenience final fields for a number of common types, those by no means limit the RegTypes available. There is a RegType that can be obtained for every type known to the database, whether built in, extension-supplied, or user-defined.

Other PostgreSQL catalog objects and key abstractions

RegType is one among the types of PostgreSQL catalog objects modeled in the org.postgresql.pljava.model package.

Along with a number of catalog object types, the package also contains:

Mapping PostgreSQL data types to what a PL supports

The Adapter class

A mapping between a PostgreSQL data type and a suitable PL data type is an instance of the Adapter class, and more specifically of the reference-returning Adapter.As<T,U> or one of the primitive-returning Adapter.AsInt<U>, Adapter.AsFloat<U>, and so on (one for each Java primitive type). The Java type produced is T for the As case, and implicit in the class name for the AsFoo cases.

The basic method for fetching a value from a TupleTableSlot is get(Attribute att, Adapter adp), and naturally is overloaded and generic so that get with an As<T,?> adapter returns a T, get with an AsInt<?> adapter returns an int, and so on. (But see this later comment below for a better API than this item-at-a-time stuff.) (The U type parameter of an adapter plays a role when adapters are combined by composition, as discussed below, and is otherwise usually uninteresting to client code, which may wildcard it, as seen above.)

A manager class for adapters

Natural use of this idiom presumes there will be some adapter-manager API that allows client code to request an adapter for some PostgreSQL type by specifying a Java witness class Class<T> or some form of super type token, and returns the adapter with the expected compile-time parameterized type.

That manager hasn't been built yet, but the requirements are straightforward and no thorny bits are foreseen. (Within the org.postgresql.pljava.internal module itself, things are simpler; no manager is needed, and code refers directly to static final INSTANCE fields of existing adapters.)

Extensibility

PL/Java has historically supported user-defined types implemented in Java, a special class of data types whose Java representations must implement a certain JDBC interface and import and export values through a matching JDBC API. In contrast, PL/Java's first-class PostgreSQL data type support—the mappings it supplies between PostgreSQL and ordinary Java types that don't involve the specialized JDBC user-defined type APIs—has been hardcoded in C using Java Native Interface (JNI) calls, and not straightforward to extend. That's a pain point for several situations:

Adapters implementable in pure Java

With this PR, code external to PL/Java's implementation can supply adapters, built against the service-provider API exposed in org.postgresql.pljava.adt.spi.

Leaf adapters

A "leaf" adapter is one that directly knows the PostgreSQL datum format of its data type, and maps that to a suitable PL type. Only a leaf adapter gets access to PostgreSQL datums, which it should not leak to other code. Code that defines leaf adapters must be granted a permission in pljava.policy.

Composing adapters

A composing, or non-leaf, adapter is one meant to be composed over another adapter. An example would be an adapter that composes over an adapter returning type T (possibly null) to form an adapter returning Optional<T>. With a selection of common composing adapters (there aren't any in this pull request, yet), it isn't necessary to provide leaf adapters covering all the ways application code might want data to be presented. No special permission is needed to create a composing adapter.

Java's generic types are erased to raw types for runtime, but the Java compiler saves the parameter information for runtime access through Java reflection. As adapters are composed, the Adapter class tracks the type relationships so that, for example, an Adapter<Optional<T>,T> composed over an Adapter<String,Void> is known to produce Optional<String>.

It is that information that will allow an adapter manager to satisfy a request to map a given PostgreSQL type to some PL type, by finding and composing available adapters.

Contract-based adapters

For a PostgreSQL data type that doesn't have one obvious best mapping to a PL type (perhaps because there are multiple choices with different advantages, or because there is no suitable type in the PL's base library, and any application will want the type mapped to something in a chosen third-party library), a contract-based adapter may be best. An Adapter.Contract is a functional interface with parameters that define the semantically-important components of the PostgreSQL type, and a generic return type, so an implementation can return any desired representation for the type.

A contract-based adapter is a leaf adapter class with a constructor that accepts a Contract, producing an adapter between the PostgreSQL type and whatever PL type the contract maps it to. The adapter encapsulates the internal details of how a PostgreSQL datum encodes the value, and the contract exposes the semantic details needed to faithfully map the type. Contracts for many existing PostgreSQL types are provided in the org.postgresql.pljava.adt package.

ArrayAdapter

The one supplied ArrayAdapter is contract-based. While a Contract.Array has a single abstract method, and therefore could serve as a functional interface, in practice it is not directly implementable by a lambda; there must be a subclass or subinterface (possibly anonymous) whose type parameterization the Java compiler can record. (A lambda may then be used to instantiate that.) An instance of ArrayAdapter is constructed by supplying an adapter for the array's element type along with an array contract targeting some kind of collection of the mapped type. As with a composing adapter, the Adapter class substitutes the element adapter's target Java type through the type parameters of the array contract, to arrive at the actual parameterized type of the resulting array or collection.

PostgreSQL arrays can be multidimensional, and are regular (not "jagged"; all sub-arrays at a given dimension match in size). They can have null elements, which are tracked in a bitmap, offering a simple way to save some space for arrays that are sparse; there are no other, more specialized sparse-array provisions.

Array indices need not be 0- or 1-based; the base index as well as the index range can be given independently for each dimension. PostgreSQL creates 1-based arrays by default. This information is stored with the array value, not with the array type, so a column declared with an array type could conceivably have values of different cardinalities or even dimensionalities.

The adapter is contract-based because there are many ways application code could want a PostgreSQL array to be presented: as a List or single Java array (flattening multiple dimensions, if present, to one, and disregarding the base index), as a Java array-of-arrays, as a JDBC Array object (which does not officially contemplate more than one array dimension, but PostgreSQL's JDBC drivers have used it to represent multidimensioned arrays), as the matrix type offered by some scientific computation library, and so on.

For now, one predefined contract is supplied, AsFlatList, and a static method, nullsIncludedCopy, that can be used (via method reference) as one implementation of that contract.

Java array-of-arrays

While perhaps not an extremely efficient way to represent multidimensional arrays, the Java array-of-arrays approach is familiar, and benefits from a bit of dedicated support for it in Adapter. Therefore, if you have an Adapter a that renders a PostgreSQL type Foo as Java type Bar, you can use, for example, a.a2().build() to obtain an Adapter from the PostgreSQL array type Foo[] to the Java type Bar[][], requiring the PostgreSQL array to have two dimensions, allowing each value to have different sizes along those dimensions, but disregarding the PostgreSQL array's start indices (all Java arrays start at 0).

Because PostgreSQL stores the dimension information with each value and does not enforce it for a column as a whole, it could be possible for a column of array values to include values with other numbers of dimensions, which an adapter constructed this way will reject. On the other hand, the sizes along each dimension are also allowed by PostgreSQL to vary from one value to the next, and this adapter accommodates that, as long as the number of dimensions doesn't change.

The existing contract-based ArrayAdapter is used behind the scenes, but build() takes care of generating the contract. Examples are provided.

Adapter maintainability

Providing pure-Java adapters that know the internal layouts of PostgreSQL data types, without relying on JNI calls and the PostgreSQL native support routines, entails a parallel-implementation maintenance responsibility roughly comparable to that of PostgreSQL client drivers that support binary send and receive. (The risk is slightly higher because the backend internal layouts are less committed than the send/receive representations. Because they are used for data on disk, though, historically they have not changed often or capriciously.)

The engineering judgment is that the resulting burden will be manageable, and the benefits in clarity and maintainability of the pure-Java implementations, compared to the brittle legacy Java+C+JNI approach, will predominate. The process of developing clear contracts for PostgreSQL types already has led to discovery of one bug (#390) that could be fixed in the legacy conversions.

For the adapters supplied in the org.postgresql.pljava.internal module, it is possible to use ModelConstants.java/ModelConstants.c to ensure that key constants (offsets, flags, etc.) stay synchronized with their counterparts in the PostgreSQL C code.

Adapter is a class in the API module, with the express intent that other adapters can be developed, and found by the adapter manager through a ServiceLoader API, without being internal to PL/Java. Those might not have the same opportunity for build-time checking against PostgreSQL header files, and will have to rely more heavily on regression tests for key data values, much as binary-supporting client drivers must. The same can be true even for PL/Java internal adapters for a few PostgreSQL data types whose C implementations are so strongly encapsulated (numeric comes to mind) that necessary layouts and constants do not appear in .h files.

Known open items

In no well-defined order ....

And then

HoussemNasri commented 1 year ago

Hi @jcflack,

I found the "PL/Java refactoring for support of other JVM and polyglot languages" project while browsing PostgreSQL GSoC projects for this year, and I think it sounds really interesting. I wanted to check and see if PL/Java is still under active development - the last commit was pushed last year - and if the GSoC project would still be relevant.

Thanks!

divyaankt commented 1 year ago

Hello @jcflack, I found this while going through Postgres' GSoC 2023 projects list. I consider myself adept with Java 11+ and a bit of Scala 2.12. Though I haven't really used Postgres in professional capacity or otherwise, I'm interested in this project as I'm primarily a Systems person at heart with an inclination towards Compilers, Runtimes, PL Theory etc. I am not familiar with JNI as such and need to brush up on C, but would first start with building the project and going through the codebase. Any suggestions from your end? Looking forward towards contributing something tangible!!

Warm Regards, Divyaank

jcflack commented 1 year ago

Hello @HoussemNasri and @divyaankt,

Thank you for your interest in the project! Yes, the GSoC project is still relevant. It didn't attract any applicants last year, and so I used the time to step a bit away from PL/Java and focus on some other things for the summer, which turned into the summer and fall (and part of winter). But I am more than happy to return to PL/Java.

AbhinavMir commented 1 year ago

What a wonderful piece of technical literature.

I'm wondering if a tool that compiles non-Java JVM code to Java (under the hood) and then uses PL/SQL could be a candidate solution since I am guessing this doesn't disrupt our JBDC centrality.

I have a bunch of other questions, but I'm just trying to grok the current sentiment going forward so I can work towards it.

jcflack commented 10 months ago

The SlotTester.test method that was hastily provided earlier, as a way of testing this new API, was rather limiting. It could only accept a String query with no parameters, causing flashbacks to the bad old days of SQL thrown together with string concatenation.

In the current state of this branch, the new API is incomplete and read-only, and the old legacy JDBC implementation is still around, so the obvious interim solution is to bridge the two, allowing JDBC Statement or PreparedStatement for issuing a query, and now SlotTester.unwrapAsPortal to present the JDBC ResultSet as a Portal object, and proceed to retrieve the results using new API.

The basic method for fetching a value from a TupleTableSlot is get(Attribute att, Adapter adp), and naturally is overloaded and generic so that get with an As<T,?> adapter returns a T, get with an AsInt<?> adapter returns an int, and so on.

That early approach in the new API, for the purpose of early signs of life, was JDBC-like in requiring a method call for each item to retrieve. Beyond the tedium of developing code in that style, it also forecloses many opportunities for efficient implementation, requiring any needed checks on user input (does this attribute belong to this result? is this adapter for a suitable type? native memory region still valid?, and so on) to be repeated for every single-item fetch. It may also be advantageous to arrange the order of fetches with some knowledge of how the tuple deforming is done, for which, again, the implementation has no flexibility with the user code picking at it one item at a time.

The idea was always to supply a multiple-columns-at-once API, which is introduced here.

TargetList, and its subinterface Projection, are used for selecting the attributes of interest. Projection, as in the algebraic usage, does not allow more than one mention of any attribute. The original Projection is the full TupleDescriptor for a result, and another Projection derived from an existing one must have a subset of its attributes, possibly reordered.

TargetList, the superinterface, relaxes the nonrepetition condition; a TargetList is allowed to mention the same attribute more than once. That should sound less efficient than simply mentioning it once and letting the user Java code copy the fetched value around, but there may be cases where it is useful. One would be when the Java code wants different representations of one PG value, produced by different Adapters. Another can be when the Java representation is a one-use-only class like SQLXML.

After shaping a Projection or TargetList to suit just what the Java code wants to retrieve, the TargetList can be applied over a List of TupleTableSlot at a time, using adapters selected for the desired Java types, and a lambda with corresponding parameters, whose types are inferred from the adapters. Functional interfaces of some likely lengths are provided, and can be curried to fit a TargetList with any number and types of columns.

An example illustrates the usage.

The work in this pull request to this point generally still calls the old heavyweight item-at-a-time get methods on TupleTableSlot under the hood, leaving the optimizations suggested above for future work, but it clears the way for those, and the old per-item get methods in their current form should eventually be deprecated.

jcflack commented 9 months ago

Dispatcher for multiple PLs implemented atop PL/Java

I had thought to continue ticking more of the other open-items boxes before doing the dispatcher, but for a change of scenery, here is the new dispatcher.

The first brand-new PL/Java-based procedural language is Glot64. It will probably never grow to rival Python or JavaScript in popularity, either because it can't do anything but write messages to standard output, or because you write your functions/procedures in base 64 :). So, here is a Glot64 function that writes Hello, world! on the server's standard output when called:

CREATE OR REPLACE FUNCTION javatest.hello()
 RETURNS void
 LANGUAGE glot64
AS 'SGVsbG8sIHdvcmxkIQo=';

The impatient may see Hello, world! immediately, using an inline code block:

DO LANGUAGE glot64 'SGVsbG8sIHdvcmxkIQo=';

The output won't be visible at all if the server's standard output is going to /dev/null or the like. But a test instance run in PL/Java's test harness, for example, will have its standard output going to the terminal.

In addition to the base-64-decoded source string, you will see other output from the glot64 language handler, which is really the point, for a demonstration example. The base-64 string is just for fun.

Glot64, like any PL/Java-based language, needs a language handler: namely, a class that implements the PLJavaBasedLanguage interface. Various methods on that interface are used for validating functions/procedures, compiling, specializing, and calling functions/procedures, and executing inline blocks (for a language that supports those).

After installing a jar containing the class that implements the language, use the name of that class to declare a validator function, using the language pljavahandler:

CREATE OR REPLACE FUNCTION javatest.glot64_validator(oid)
 RETURNS void
 LANGUAGE pljavahandler
AS 'org.postgresql.pljava.example.polyglot.Glot64'; -- class name

followed by CREATE LANGUAGE using that new function as the validator, along with PL/Java's existing routine and inline dispatcher functions as the other two handlers:

CREATE LANGUAGE glot64
 HANDLER sqlj.pljavaDispatchRoutine
 INLINE  sqlj.pljavaDispatchInline
 VALIDATOR javatest.glot64_validator;

Bear in mind that the very first still-unticked "open items" box at the top of this pull request is still:

The to-PostgreSQL direction for Adapter, TupleTableSlot, and Datum.Accessor.

and that's why no PL/Java-based function or procedure can return any results yet. That will be done by storing the result value (or values) into the Call.result() TupleTableSlot, and the store direction doesn't work yet. So that's why Glot64 is limited to writing messages on standard output.

On the other hand, fetching from a TupleTableSlot is indeed working already, so a language handler can fetch values from the Call.arguments() TupleTableSlot using whatever Adapter is appropriate to each argument's type. The Glot64 language ignores passed arguments, but that's not a necessary limitation.

Also, of course, all the other unticked boxes in that open-items list are still unticked, so plenty of work remains. But the dispatcher is here, and the PLJavaBasedLanguage interface, enough to begin experimenting with the development of language handlers for languages of interest.