lanterndata / lantern

PostgreSQL vector database extension for building AI applications
https://lantern.dev
Other
746 stars 53 forks source link

Add SQL code formatter #45

Open Ngalstyan4 opened 1 year ago

Ngalstyan4 commented 1 year ago

This should format lantern.sql and regression test SQL files. Sqlfluff is one good option to use for this, but have not looked too much into it.

dqii commented 1 year ago

I tried adding sqlfluff to Dockerfile.dev with

pip install sqlfluff

and to CMakeLists.txt with

# Find all SQL files in your project
file(GLOB_RECURSE SQL_FILES ${CMAKE_SOURCE_DIR}/*.sql)

# Add custom target to lint SQL files
add_custom_target(
  lint-sql
  COMMAND sqlfluff lint ${SQL_FILES} --dialect postgres
  COMMENT "Linting SQL files with SQLFluff"
)

# Add custom target to format SQL files
add_custom_target(
  format-sql
  COMMAND sqlfluff fix --force ${SQL_FILES} --dialect postgres
  COMMENT "Formatting SQL files with SQLFluff"
)

I was able to run it with make format-sql, but there's a lot of issues from our custom operator, such as:

L:  45 | P:  21 |  PRS | Line 45, Position 21: Found unparsable section: '<->
                       | array[0,1,0] LIMIT 7'
dqii commented 1 year ago

I think pgFormatter will be a better fit, since there is also pl/pgsql code.

dqii commented 1 year ago

I'm stopping work on this. Some comments for visibility.

The overall reason is that there isn't sufficient support for auto-formatting SQL + psql + pl/pgsql.

sqlfluff is one of the more customizable SQL formatters, but it doesn't for example support custom operators (e.g., <->), psql commands like \gset and \echo, and variable substitution like LIMIT :LIMIT.

pgFormatter worked fine with custom operators, but I think it was inadequate. It didn't for example handle CTEs well, and I didn't see a good way to fix that. Also, it didn't support "magic comments". They do support placeholders, which I tried to use for magic comments, but this caused formatting issues. (e.g., next line always started with a space)

pglast I liked because it was based on libpg_query, which is a port of Postgres's parser. I wasn't the biggest fan of all their formatting choices, but I thought it'd be easier to hack together support for the psql commands on top of it + magic comments. My attempt is included in the closed PR. In the end, it didn't handle \gset ideally, it didn't handle :'v4444' (but did handle limit :limit). I could spend some more time getting those cases to work, but then there's the additional pain of, sometimes it doesn't format \gset in the way you might like.

At that point, if you're constantly using magic comments to force it to format in an aesthetic way, what is the point?

I think if there's a nice formatter with psql support in the future, it would be great to use that.

bobdevine commented 12 months ago

Hi, I'm taking a fresh look at this issue. Two solutions seem possible: 1) use a third-party formatter/beautifier to standardize the SQL statements. But while there are dozens to consider, most fail to handle the PostgreSQL-specific syntax. 2) Use the PostgreSQL lexer/parser. This solution guarantees full comparability. If the query-string's syntax is correct, a parse tree is created and then the tree can be printed in standard format.

I suggest that choice (2) be implemented using a library (https://github.com/pganalyze/libpg_query).

If this approach is acceptable, I can write a testing tool.

Ngalstyan4 commented 12 months ago

This sounds interesting, @bobdevine !

Quick question:

Use the PostgreSQL lexer/parser. This solution guarantees full comparability

Does this guarantee compatibility with PSQL magic commands (\d, \t+, etc) or is it only about SQL? Our formatter needs to handle PSQL magic commands since we use those in our tests.

It seems @dqii considered an approach similar to your proposal (2) via pglast (which is also based on libpg_query!) but that approach did not handle PSQL magic commands.

bobdevine commented 11 months ago

formatter.zip

Here's a zip-file containing 2 C files to read a test file and then format all lines that are SQL. Formatting uses the PostgreSQL parser to validate the SQL command. Then the parsed SQL is "deparsed" to a canonical string.