jOOQ / jOOQ

jOOQ is the best way to write SQL in Java
https://www.jooq.org
Other
6.11k stars 1.2k forks source link

Support for Trino DB #11485

Closed lukaseder closed 1 year ago

lukaseder commented 3 years ago

Formerly known as Presto (https://github.com/jOOQ/jOOQ/issues/5414), now Trino: https://trino.io/blog/2020/12/27/announcing-trino.html

Database features / missing features

Some bugs found:

Missing features to work around:

Also, existing PrestoDB limitations:


See also:

bannmann commented 3 years ago

Hint: Depending on the SQL queries one uses, people may have success using jOOQ's PostgreSQL dialects to connect to Trino. We have been using this workaround for more than 4 years now without any trouble.

HasanAmmori commented 3 years ago

@bannmann I wonder how you connect PostgreSQL dialect to Trino since two dialects are vastly different. I assume, you could write a SQL query, that would work on both engines, but I don't think this is a jOOQ's core idea

bannmann commented 3 years ago

@HasanAmmori I don't know what exactly is affected by the choice of dialects, so I can only guess: maybe the reason is that the product in question does not use jOOQ's query builder or code generation. It merely uses jOOQ to execute user-defined SQL queries and to read result sets. I imagine that is a rather unusual scenario, so if that is the only reason it works, then my hint above is useless to most people :-/

lukaseder commented 3 years ago

A dialect consists of:

By now, it takes me about 2 person weeks to fully support a new dialect initially, discounting all the efforts that appear after going live, maintaining the dialect for years. For example, I've already spent almost 2 days getting the Vertica dialect up to date with more recent jOOQ developments, given that a customer had a feature request, and the investment was justified: https://github.com/jOOQ/jOOQ/issues/11650

Other ORMs which do not offer nearly as much SQL functionality as jOOQ, may find tweaking their internals for new dialects simpler. Their generated SQL is trivial SQL-92 - sometimes not even supporting unions, which are hard enough to get right on their own with all the possible cases of required/optional/forbidden parentheses!

So, the rule of thumb is this: You can get jOOQ to work with an existing dialect on a new database product to some extent. It will never be really as good as you're used to from supported dialects, and you will constantly run into issues, in case of which patching jOOQ or running plain SQL are the only options. We generally do not recommend using jOOQ with an unsupported dialect, but it may still work, and may still be better than any alternative.

Having said so, I'm very open to a sponsored integration project. The financial effort of paying me for a Trino integration is likely less than the effort of patching jOOQ all the time when it doesn't work with Trino, and it will pay off in the long run, after years of Trino support.

lukaseder commented 1 year ago

Looking into this for possible inclusion in jOOQ 3.19

First "fun" roadblock:

SQL Error [13]: Query failed (#20230308_161807_00031_btv7d): line 1:17: Catalog 'memory' does not support non-null column for column name 'i'

Not sure what I should think of this πŸ˜… No support for NOT NULL constraints?

lukaseder commented 1 year ago

Also:

This connector does not support modifying table rows

Gee. Which connector should I use for integration testing, to avoid most quirks / get access to most features? Obviously, I don't want to test them all.

ThoSap commented 1 year ago

Also:

This connector does not support modifying table rows

Gee. Which connector should I use for integration testing, to avoid most quirks / get access to most features? Obviously, I don't want to test them all.

~~@lukaseder which connector does not support this, I did not find it in the docs: https://github.com/search?q=repo%3Atrinodb%2Fdocs.trino.io+%22This+connector+does+not+support+modifying+table+rows%22&type=code~~

Nevermind, found it: https://github.com/search?q=org%3Atrinodb%20%22This%20connector%20does%20not%20support%20modifying%20table%20rows%22&type=code

I can only chime in by saying that at our company we never used the memory connector once, we use the following connectors (of many supported by Trino):

Just the fact that it is possible for (crazy) example to query Elasticsearch, Kafka, MongoDB, Redis and Oracle using standard SQL, even joining tables/documents from different DBs (that are not RDBMS with RDBMS) in the same query, makes Trino soooo powerful!

Also many Fortune 100 companies now use Delta Lake in combination with Trino to store, fetch and analyze large-scale data. Recently we also started to insert data into Delta Lake using Trino (in addition to Spark streaming).

jOOQ supporting Trino starting with 3.19.0 would make a huge difference and would also be a major selling point compared to other ORMs.

richardfearn commented 1 year ago

See https://trino.io/docs/current/connector/memory.html - when using the Memory connector the tables are effectively append-only. You can create a table, you can insert rows into it, and read them back; but you can't modify existing rows or delete them.

findepi commented 1 year ago

Modifying existing rows, deletes, NOT NULL constraints are supported by Iceberg and Delta Lake connectors. Adding fully fledged features to the memory connector was not a priority so far, but I understand how it would fit well the jOOQ testing.

lukaseder commented 1 year ago

Thanks for looking that up, @ThoSap, @richardfearn

Just the fact that it is possible for (crazy) example to query Elasticsearch, Kafka, MongoDB, Redis and Oracle using standard SQL, even joining tables/documents from different DBs (that are not RDBMS with RDBMS) in the same query, makes Trino soooo powerful!

No doubt, it is compelling for users. For me, I just need the path of least resistance to access the most features that are possibly available. I would have expected the memory (or file based) connector to be complete, because it does not "suffer" from any third party limitations, but obviously, that would mean that all transactional and other constraints would have to be implemented by trino itself.

I guess trino is more about "distributing compute" rather than implementing stuff itself? Makes sense for end users, of course.

jOOQ supporting Trino starting with 3.19.0 would make a huge difference and would also be a major selling point compared to other ORMs.

Yeah πŸ˜… I guess once you use one of those more rare RDBMS from the bottom of the list here https://db-engines.com/en/ranking, then support for that RDBMS does become a "selling point." I don't think it will make a key difference for the jOOQ community, but anyway, the number of upvotes on the issue has helped the issue bubble up in this list: https://github.com/jOOQ/jOOQ/issues?page=1&q=is%3Aissue+sort%3Areactions-%2B1-desc+is%3Aopen

A contender is Clickhouse, but there, I think the vendor cheated with the upvotes.

Anyway. I was planning on adding support for 2-3 new dialects in 3.19, and this has been requested a few times by existing paying customers as well, so that justifies the effort without trying to find separate budget.

@findepi Modifying existing rows, deletes, NOT NULL constraints are supported by Iceberg and Delta Lake connectors. Adding fully fledged features to the memory connector was not a priority so far, but I understand how it would fit well the jOOQ testing.

I don't even know what those things are πŸ˜… (Iceberg and Delta Lake). I don't really want to spend time setting them up in a test environment.

How do you test these things? I mean, you have to have a complete test suite for all of the SQL features somewhere. Do you just test them indirectly via the connector that makes them available?

Is there a support matrix somewhere that shows which SQL features are supported by which connector? (I'm aware that each connector lists things in prose, but a matrix would be cool)

lukaseder commented 1 year ago

Anyway, the lack of DELETE support is just a problem for the setup of integration tests, which have a reset.sql script that deletes data again prior to resetting. It's probably possible to work around this limitation by simply resetting the entire schema instead of just the test delta, to get this started.

findepi commented 1 year ago

How do you test these things? I mean, you have to have a complete test suite for all of the SQL features somewhere. Do you just test them indirectly via the connector that makes them available?

Trino Engine cannot do anything without a connector. Besides testing with real connectors, we test with connectors that generate data on the fly (eg tpch) and with mocks/stubs for unit level tests.

Is there a support matrix somewhere that shows which SQL features are supported by which connector? (I'm aware that each connector lists things in prose, but a matrix would be cool)

I am not aware of such a matrix. @bitsondatadev?

It's probably possible to work around this limitation by simply resetting the entire schema instead of just the test delta, to get this started.

That makes sense.

lukaseder commented 1 year ago

Random musing. Some connectors don't support DELETE or NOT NULL, but hey, MATCH_RECOGNIZE, that's a piece of cake to implement πŸ˜…

GavinRay97 commented 1 year ago

As a side-tangent, supporting Trino would effectively give you Presto and Athena support as well, since they use (more or less) the same engines. When we developed jOOQ for Athena, we often used Trino as a secondary testing database because it was faster to test with.

Trino is the most up to date and featureful but I'd expect 95% of the Trino stuff to work with the others.

lukaseder commented 1 year ago

Thanks, @GavinRay97. I had a similar feeling.

bitsondatadev commented 1 year ago

I guess trino is more about "distributing compute" rather than implementing stuff itself?

Mostly correct, Trino implements a distributed query engine layer and then has a connector framework to multiple databases. Depending on the database will dictate if Trino implements more or less "stuff" lol. So the Iceberg/Delta Lake things that Piotr mentioned are just metadata models on top of object storage. Trino implements the most stuff with these connectors because it's not pushing queries down but running the entire execution in it's query layer.

So likely the most test coverage would come from these systems which unfortunately require more test setup. It's a pretty linear cost to benefit function. I think a happy medium would be using something like PostgreSQL which uses a JDBC connection to connect to Postgres. Perhaps you already have some testing infrastructure in JOOQ that sets up Postgres?

Is there a support matrix somewhere that shows which SQL features are supported by which connector? (I'm aware that each connector lists things in prose, but a matrix would be cool)

I am not aware of such a matrix. @bitsondatadev?

Nothing today but this is on the roadmap.

bitsondatadev commented 1 year ago

Random musing. Some connectors don't support DELETE or NOT NULL, but hey, MATCH_RECOGNIZE, that's a piece of cake to implement πŸ˜…

LOL! MATCH_RECOGNIZE lives in the query engine where DELETE differs depending on the connector. Most do though but this made me chuckle! You'll see that Trino's use cases up until the last couple years have been OLAP-focused so that's why there's more fancy reading capabilities and less support for updates/deletes.

lukaseder commented 1 year ago

Perhaps you already have some testing infrastructure in JOOQ that sets up Postgres?

Sure, it's one of the best tested integrations in jOOQ. But in this case, I'd have to think about how to nest a PostgreSQL instance in a docker container containing Trino, and run all that in testcontainers, or alternatively, bridge two containers and let them communicate. If there's a path of lesser resistance, I would love to prefer that, because all the other 31 currently supported RDBMS require constant attention as well...

It's not unlikely, however, that this can be done in 2 steps:

  1. Get 80% of the Trino dialect to work with the memory connector.
  2. Get the rest to work "later", whenever that is.
lukaseder commented 1 year ago

Btw, thanks for your help, @bitsondatadev, @findepi: I really appreciate it!

bitsondatadev commented 1 year ago

I like that plan! πŸ˜€ once you do the bulk of the work are contributions for testing encouraged? I'd be happy to help with building out a test container.

findepi commented 1 year ago

I'd have to think about how to nest a PostgreSQL instance in a docker container containing Trino, and run all that in testcontainers, or alternatively, bridge two containers and let them communicate

That should be possible with a org.testcontainers.containers.Network, if you're already using testcontainers.

lukaseder commented 1 year ago

I like that plan! πŸ˜€ once you do the bulk of the work are contributions for testing encouraged?

Probably not? See https://github.com/jOOQ/jOOQ/blob/main/CONTRIBUTING.md I'd rather reach out with specific questions.

I'd be happy to help with building out a test container.

From experience, that won't be too hard. But I'll be happy to reach out if I run into troubles. First steps are always manual so I'm not at the TC stage yet.

bitsondatadev commented 1 year ago

I see. Thanks for the clear clarification. Please tag me if you think I can be instrumental in any way!

lukaseder commented 1 year ago

A first draft has been pushed to 3.19.0-SNAPSHOT last week. Now, the yak shaving begins

lukaseder commented 1 year ago

For those of you keen on using MULTISET with Trino, perhaps give these issues a bit of love:

I'll be reporting more in the near future. Seems like there's still some rough edges when working with non-trivial JSON queries... I'm going to be listing more bugs that I find in the issue description. It might as well be that MULTISET won't be ready for jOOQ 3.19.

GavinRay97 commented 1 year ago

To get around these issues @lukaseder, we actually use the map_from_entries() function along with cast() as json and row number aggregations

It's the only way I've found to get around that "nested aggregations" in JSON_OBJ limitation/bug and some of the other limitations We spent months on this

To start with, here's an equivalent query for SQLite:

SELECT (SELECT JSON_OBJECT('rows', JSON_GROUP_ARRAY(j))
        FROM (SELECT JSON_OBJECT('Name', "Name", 'ArtistId', "ArtistId", 'Albums',
                                 (SELECT JSON_OBJECT('rows', JSON_GROUP_ARRAY(j))
                                  FROM (SELECT JSON_OBJECT('Title', "Title", 'ArtistId', "ArtistId", 'AlbumId',
                                                           "AlbumId", 'Tracks',
                                                           (SELECT JSON_OBJECT('rows', JSON_GROUP_ARRAY(j))
                                                            FROM (SELECT JSON_OBJECT('Name', "Name", 'TrackId', "TrackId") AS j
                                                                  FROM (SELECT *
                                                                        FROM "Track" AS "Track_pcndnm"
                                                                        WHERE "Album_urz-ar"."AlbumId" = "Track_pcndnm"."AlbumId"
                                                                        ORDER BY "Track_pcndnm"."Name" DESC LIMIT 5) AS "Track_7afunl"))) AS j
                                        FROM (SELECT *
                                              FROM "Album" AS "Album_wu7pho"
                                              WHERE "Artist_smbap_"."ArtistId" = "Album_wu7pho"."ArtistId"
                                              ORDER BY "Album_wu7pho"."Title" DESC LIMIT 3) AS "Album_urz-ar"))) AS j
              FROM (SELECT *
                    FROM "Artist" AS "Artist_yb1x2h"
                    ORDER BY "Artist_yb1x2h"."Name" DESC LIMIT 5
                    OFFSET 75) AS "Artist_smbap_")) as data

And then the same thing in Athena/Presto/Trino:

select cast(map_from_entries(array[('rows', cast(array_agg(map_from_entries(array[
  ('Name', cast("artist_base_fields"."name" as json)),
  ('ArtistId', cast("artist_base_fields"."artistid" as json)),
  ('Albums', cast(coalesce(
    "Albums"."rows_and_aggregates",
    cast(map_from_entries(array[('rows', array[])]) as json)
  ) as json))
])) as json))]) as json) "rows_and_aggregates"
from (
  select
    "AwsDataCatalog"."chinookci"."artist".*,
    row_number() over (order by "AwsDataCatalog"."chinookci"."artist"."name" desc) "rn"
  from "AwsDataCatalog"."chinookci"."artist"
  order by "rn"
) "artist_base_fields"
  left outer join (
    select
      "Albums"."artistid",
      cast(map_from_entries(array[('rows', cast(array_agg(map_from_entries(array[
        ('Title', cast("Albums"."title" as json)),
        ('ArtistId', cast("Albums"."artistid" as json)),
        ('AlbumId', cast("Albums"."albumid" as json)),
        ('Tracks', cast(coalesce(
          "Tracks"."rows_and_aggregates",
          cast(map_from_entries(array[('rows', array[])]) as json)
        ) as json))
      ])) as json))]) as json) "rows_and_aggregates"
    from (
      select
        "AwsDataCatalog"."chinookci"."album".*,
        row_number() over (
          partition by "AwsDataCatalog"."chinookci"."album"."artistid"
          order by "AwsDataCatalog"."chinookci"."album"."artistid", "AwsDataCatalog"."chinookci"."album"."title" desc
        ) "rn"
      from "AwsDataCatalog"."chinookci"."album"
      order by "rn"
    ) "Albums"
      left outer join (
        select
          "Tracks"."albumid",
          cast(map_from_entries(array[('rows', cast(array_agg(map_from_entries(array[
            ('Name', cast("Tracks"."name" as json)),
            ('TrackId', cast("Tracks"."trackid" as json))
          ])) as json))]) as json) "rows_and_aggregates"
        from (
          select
            "AwsDataCatalog"."chinookci"."track".*,
            row_number() over (
              partition by "AwsDataCatalog"."chinookci"."track"."albumid"
              order by "AwsDataCatalog"."chinookci"."track"."albumid", "AwsDataCatalog"."chinookci"."track"."name" desc
            ) "rn"
          from "AwsDataCatalog"."chinookci"."track"
          order by "rn"
        ) "Tracks"
        where "rn" <= 5
        group by "Tracks"."albumid"
      ) "Tracks"
        on "Albums"."albumid" = "Tracks"."albumid"
    where "rn" <= 3
    group by "Albums"."artistid"
  ) "Albums"
    on "artist_base_fields"."artistid" = "Albums"."artistid"
where (
  "rn" <= 80
  and "rn" > 75
)
GavinRay97 commented 1 year ago

The above was found out first by taking the answer provided by this amazing SO person:

And realizing that it breaks after some levels of nesting, there's a bug in the decorrelator:

And then coworker explained you can use joins with group-by to emulate correlated subquery

image

lukaseder commented 1 year ago

To get around these issues @lukaseder, we actually use the map_from_entries() function along with cast() as json and row number aggregations

Yeah, I've found that one as well, and documented it in the issue: https://github.com/trinodb/trino/issues/16525#issue-1621586184

We spent months on this

Well, there is always a way to expedite SQL dialect support in native jOOQ, for the next dialect... πŸ˜…

Anyway, I may have posted this a bit early (the part about MULTISET maybe not being ready for 3.19). I'll definitely spend more time with casting stuff around this week. Perhaps, the default in Trino should be NestedCollectionEmulation.ARRAY after all, given that this is the thing that works (though I ran into some ROW projection bugs as well, which I haven't investigated yet)

And realizing that it breaks after some levels of nesting, there's a bug in the decorrelator:

I'll probably run into this one as well soon, once the yak is shaved a bit more. Thanks for the heads up!

lukaseder commented 1 year ago

(though I ran into some ROW projection bugs as well, which I haven't investigated yet)

Ah no, I remember now. It's not a bug, just missing JDBC support: https://github.com/trinodb/trino/issues/16479

lukaseder commented 1 year ago

ABSENT ON NULL semantics can be achieved with a FILTER. Instead of:

select json_array(1, null, 2 absent on null)

We can render:

select cast(filter(
  array[
    cast(1 as json),
    cast(null as json),
    cast(2 as json)
  ],
  e -> e is not null
) as json)
lukaseder commented 1 year ago

Alright, so using CAST(MAP AS JSON) instead of JSON_OBJECT and CAST(ARRAY AS JSON) instead of JSON_ARRAY seems to work around most of the limitations I've found so far.

lukaseder commented 1 year ago

Development status quo of 3.19.0-SNAPSHOT is already available online at https://www.jooq.org/translate

image

Note the redundant cast will be addressed, later: https://github.com/jOOQ/jOOQ/issues/14798

lukaseder commented 1 year ago

Integrated for jOOQ 3.19.

The implementation is far from complete. The fact that the in-memory backend lacks a ton of functionality (yet it is the simplest to test without excessive test setup) means that jOOQ probably doesn't cover some stuff users might be using, e.g. DELETE or UPDATE syntax.

This can be addressed in future releases, still.

lukaseder commented 1 year ago

To anyone interested in this dialect, please note that early feedback is very appreciated! You can build the jOOQ Open Source Edition off github, or work with a 3.19.0-SNAPSHOT build, if you're licensed: https://www.jooq.org/download/versions