jOOQ / jOOQ

jOOQ is the best way to write SQL in Java
https://www.jooq.org
Other
6.12k stars 1.2k forks source link

Implement a GraphQL to SQL translator #10122

Open lukaseder opened 4 years ago

lukaseder commented 4 years ago

With the recent investments in SQL/JSON, it should be possible to implement a simple GraphQL to SQL translator that can produce SQL queries from GraphQL specifications.

As shown in this Stack Overflow question, a GraphQL query like this:

type Foo {
  id: ID! name: String bars: [Bar]
}
type Bar {
  id: ID! name: String
}
query{
  foo(arg: String): [Foo]
}

Could be translated to a SQL query like this:

SELECT
  JSON_ARRAYAGG(
    JSON_OBJECT(
      KEY "id" VALUE foo.id,
      KEY "name" VALUE foo.name,
      KEY "bars" VALUE (
        SELECT 
          JSON_ARRAYAGG(
            JSON_OBJECT(
              KEY "id" VALUE bar.id,
              KEY "name" VALUE bar.name
            )
          )
        FROM bar
        WHERE bar.foo_id = foo.id
      )
    )
  )
FROM foo

This uses correlated subqueries. An approach using LEFT JOIN and GROUP BY is also possible, and perhaps other options exist, e.g. using the SQL standard <JSON array constructor by query>.

This task will be divided in subtasks as follows:


See also: https://stackoverflow.com/a/61428689/521799

mdemierre commented 3 years ago

This is essentially what projects like Hasura or PostGraphile do (for PostgreSQL specifically). They introspect the whole PostgreSQL schema and build from it:

Maybe those can be useful inspirations for such a feature?

I also recently stumbled upon this which seems like a PoC for something really powerful based on jOOQ: https://github.com/shaposhnyk/graphql-jooq

lukaseder commented 3 years ago

Thanks for your comment, @mdemierre. I've seen a few jOOQ based POCs already. It's a very promising approach, especially because all the jOOQ goodies (the many SPIs and SQL transformation capabilities) can be leveraged

JayAhn2 commented 2 years ago

Wow! this can be another game changer like multiset and json function :)

GavinRay97 commented 2 years ago

I have a working implementation of this, unfortunately I can't release it as it's not open source

But I wouldn't be opposed to rewriting it all a second time. There aren't a ton of folks that have worked with GraphQL AST and codegen stuff.

There are two approaches you can take here:

  1. Do a "mechanical" translation of the GraphQL AST to SQL

The benefit to this is that it doesn't require generating a schema, which is a non-trivial task. But you don't have any information about the target types being resolved.

To do this, you need to define a mapping between GraphQL identifiers and SQL operations, plus GraphQL arguments and relational algebra arguments.

I wrote a guide on how to do this using Apache Calcite's RelBuilder class as the query builder (you could swap out jOOQ's DSL.*** methods trivially) here:

https://github.com/GavinRay97/calcite-new-frontend-tutorial/blob/master/Calcite%20Frontend%20Guide.ipynb

  1. Generate a typed GraphQL schema from known SQL datasources, and reference those types during translation

This is a bit more involved, but I've done that here:

https://github.com/GavinRay97/GraphQLCalcite

Which again could be converted to jOOQ, though with a bit more effort

lukaseder commented 2 years ago

But I wouldn't be opposed to rewriting it all a second time.

I don't think anyone would be opposed to this! πŸ˜‰

The benefit to this is that it doesn't require generating a schema, which is a non-trivial task. But you don't have any information about the target types being resolved.

jOOQ can work with any type of schema source via org.jooq.Meta, including parsed and interpreted DDL, XML, JDBC DatabaseMetaData, etc.

petromir commented 1 year ago

I would love to see the option to convert GraphQL request data to SQL

lukassailer commented 11 months ago

I know its not the first prototype like this but i've also written a program that translates GraphQL queries to SQL using jOOQ. Maybe people here are interested? Feel free to check it out :) (My goal was to not generate the GraphQL schema but instead create a middle layer that links GraphQL to SQL) https://github.com/lukassailer/byos

GavinRay97 commented 11 months ago

@lukassailer I saw your comment yesterday, but there is a lot to say/unpack here so it took me a while to compose a response.

I read your thesis -- how lucky of you to work with the renowned Torsten Grust! I'm a big fan of his publications and educational content around database internals. Quite a nice fellow, too =)

For some background context, I've spent the last ~5 years doing work with programmatic GQL schema generation and query interpretation/query compilation from GQL -> [Some IR] -> SQL.

I've spent most of this time at Hasura, where I wrote the translation layers for a number of DB's (MySQL/Maria, Oracle, Trino/Presto/Athena).

There was one point I wanted to raise around your thesis:

"GraphQL servers with GraphQL to SQL mapping, such as Hasura, generate the whole GraphQL schema, thus removing the ability to customize it almost entirely. These tools may reveal too much of the database to API users. Furthermore, their opinionated nature may clash with legacy projects that have specific demands, for example, in authentication/authorization."

I don't intend this to be a discussion about Hasura, but as far as the above is concerned, this isn't quite correct.

πŸ‘‡ CLICK TO EXPAND HASURA RANT _Just_ exposing your datasource schema over a GQL API is not a viable solution to build real-world applications. In Hasura, you're free to extend the GQL API by integrating your own REST or GQL services. In the case of REST API's, you must provide GQL schema type definitions for your endpoints (if it's an OpenAPI service, we can do that inference for you). - https://hasura.io/docs/latest/remote-schemas/overview/ - https://hasura.io/docs/latest/actions/overview/ - https://hasura.io/docs/latest/actions/open-api/ There's also an SDK for defining custom datasources, where your service is responsible for resolving the Query IR into rows of data. (This is the area I've spent a good chunk of time on and currently help maintain): - https://hasura.io/blog/hasura-graphql-data-connectors/ As far as revealing too much of the DB to users -- it's up to a user to decide what tables ought to be exposed, and what the authorization policies should be for CRUD operations. Another thing in this area is the issue of decoupling your raw database schema from the public API surface (or things like versioning) which DAO's/DTO's commonly handle in codebases. In Hasura, this is generally solved by defining views over tables (views in Hasura are insertable/updatable and delegate to the underlying table). For authentication, you can either plug in a webhook that returns a JSON object containing claims that can be used in authorization policies, or use a JWT that is signed once per login rather than invoked every query. This is pretty flexible since there's not much you can't do with a webhook.

Also, you bring up JoinMonster in your thesis. There is one pretty interesting thing about JoinMonster that I didn't see discussed, and it's that it uses a novel approach to querying for nested data/relationships.

πŸ‘‡ CLICK TO EXPAND JOINMONSTER RANT It uses an algorithm called "nest-hydration": https://github.com/CoursePark/NestHydrationJS Essentially you give it a schema/mapping definition for your relationships, and it recursively builds a tree from flat rows, which you can then return as your GQL response. It's terribly inefficient, because it does a cartesian product of the relations. But it's the only way I know of, to compile arbitrarily-nested GQL queries to SQL if a DB dialect doesn't support any form of hierarchical/tree-based structures (XML, JSON, multisets, etc). One practical example of this is AWS Redshift. It's version of Postgres has no support for the above.

On the topic of the actual SQL generated, there's one thing I'd share with you. It's about the debacle of compiling queries for DB dialects that don't fully support correlated subqueries or LATERAL/CROSS APPLY.

Practical examples of this being Presto/Trino/Athena.

πŸ‘‡ CLICK TO EXPAND SQL RANT Suppose we have some dummy rows like this: ```sql with artist(artistid, name) as (values (1, 'Artist 1'), (2, 'Artist 2'), (3, 'Artist 3') ), album(albumid, artistid, title) as (values (1, 1, 'Album 1-1'), (2, 1, 'Album 1-2'), (3, 1, 'Album 1-3'), --- ) ``` And our goal is to fetch the first 2 artists, and then for each artist, their first 2 albums with an offset of 1. We might write something like: ```sql SELECT * FROM (SELECT * FROM artist OFFSET 1 LIMIT 2) AS artist JOIN LATERAL (SELECT * FROM album WHERE album.artistid = artist.artistid OFFSET 1 LIMIT 2) AS album ON TRUE ``` Except in Presto/Trino/Athena, this doesn't work =( - https://github.com/trinodb/trino/issues/14199 ```diff Query 20231110_211220_00015_4gmuk failed: line 27:19: Given correlated subquery is not supported FROM (SELECT * FROM artist ORDER BY artistid OFFSET 1 LIMIT 2) AS artist - JOIN LATERAL (SELECT * FROM album WHERE album.artistid = artist.artistid ORDER BY albumid OFFSET 1 LIMIT 2) AS album ON TRUE ``` Okay so we can't use LATERAL. What about: ```sql SELECT artist.artistid, artist.name, (SELECT array_agg(album.title) FROM ( SELECT * FROM album WHERE album.artistid = artist.artistid ORDER BY album.albumid OFFSET 1 LIMIT 2 ) AS album ) AS albums FROM ( SELECT * FROM artist ORDER BY artistid OFFSET 1 LIMIT 2 ) AS artist ``` Still no dice: ![image](https://github.com/jOOQ/jOOQ/assets/26604994/6815f7a3-e2ba-40c1-9070-068f820759cb) The trick for databases like this is to use a combination of `row_number()` window functions in lieu of `LIMIT` and `OFFSET`, and then to use a `JOIN` with a `GROUP BY` to emulate correlated subqueries. This works for even the most barebones of SQL dialects. > (Credit for discovering this goes to my coworker, Vamshi) ```sql SELECT cast(array_agg(albums_json) as json) FROM ( SELECT artist.artistid, artist.name FROM artist ORDER BY artist.artistid OFFSET 1 LIMIT 2 ) a LEFT JOIN ( SELECT album.artistid, cast(array_agg(album.title) as json) as albums_json FROM ( SELECT *, row_number() OVER (PARTITION BY artistid ORDER BY albumid) as rn FROM album ) album WHERE rn BETWEEN 2 AND 4 GROUP BY album.artistid ) ab ON a.artistid = ab.artistid; ``` ``` ------------------------------------------------------- [["Album 2-2","Album 2-3"],["Album 3-2","Album 3-3"]] (1 row) ```

The final piece I wanted to touch on was the pros/cons of doing automatic schema generation versus manual schema definitions.

πŸ‘‡ CLICK TO EXPAND SCHEMA RANT The gist of my viewpoint and anecdotes here is roughly: - The scale of schema sizes in enterprise applications can be massive. - It's not uncommon to have to integrate datasources with +5,000 tables. - Trying to transcribe these sorts of schemas by hand is quite an undertaking and prone to human error - When you do automatic schema generation/inference, you ensure that the whatever parts of the underlying data are chosen to be exposed, are always in-sync with the data's definition - In my experience, it's fairly uncommon to interact with API's/datasources that don't have a machine-readable schema definition - INFORMATION_SCHEMA, OpenAPI, GraphQL, SOAP, XML + WSDL, gRPC, etc In my experience, the best UX is offered by at least _providing_ the opportunity to derive a schema from a datasource, and then allowing the user to customize it as they see fit. A sort of "derive-and-customize" approach that allows extensibility and enhancement.

I have a working proof of concept for taking an org.jooq.Meta and deriving an executable GraphQL schema from it.

I haven't uploaded the code yet, but I'll do it tonight or tomorrow and post here.

I'm interested in eventually getting this upstreamed into jOOQ if possible, so it'd be good to get @lukaseder thoughts on the implementation once I share it here.


Conclusion

As an aside, are you looking for work? πŸ˜ƒ If you wanted to work on this sort of thing for a living, Hasura would be a solid place to do it!

Also, let me know if you ever want to chat more about this topic. I'm very passionate about the topic and am always interested in hearing other people's perspectives and experiences.

You can reach me at (lastname).(firstname)97 at (the google mail), or on Twitter at https://twitter.com/GavinRayDev

lukaseder commented 10 months ago

I'm interested in eventually getting this upstreamed into jOOQ if possible, so it'd be good to get @lukaseder thoughts on the implementation once I share it here.

Always interested in use-cases like these and how to make them easier to implement on top of jOOQ (e.g. your probably related feature requests for better dynamic SQL construction APIs?). I'm not sure about GraphQL, still. Making jOOQ GraphQL aware seems like a bottomless pit, but again, helping GraphQL APIs build on top of jOOQ is definitely a great thing.

alf commented 4 months ago

Just bumped into this and wanted to chime in that we’re also building a solution that generates implementation code for a GraphQL Schema using jOOQ queries given a mapping configuration in the form of GraphQL directives.