Open psteinroe opened 11 months ago
cc @nene because of https://github.com/nene/sql-parser-cst
Have you guys explored the idea of using property based testing to have greater confidence in the implementation accuracy? Great work by the way just read the post!
Have you guys explored the idea of using property based testing to have greater confidence in the implementation accuracy?
I am currently thinking about leveraging an LLM for this. I have no practical experience with it, but in theory, I think something like this could work:
and maybe event have it write get_node_properties
itself:
get_node_properties
to fix the error. repeat until parser succeeds. don't know if that really works, but it would take away a lot of repetitive work.
Well honestly that is a novel approach, I won't be surprised if there are papers being written comparing PBT against the use of LLM.
Good things about PBT is that the sampling phase creates good coverage in all the domain of the problem and that if an error is found there is a shrinking phase that looks for the minimum example.
Those things (I believe) cannot be controlled using LLMs.
The safest approach would be PBT IMHO but I do believe the LLMs is a very good idea (problem is that you will be on your own because of moving into uncharted territory)
to whoever deleted the comment about combining queries (union, intercept etc), they should now be fixed: #98
We recently finished the implementation of the core parser algorithm. Details can be found in this blog post.
While a lot of work has been automated with procedural macros, we still need to provide the keywords for a node in
get_node_properties
manually. For example, aSelectStmt
node has theselect
keyword as a property, and if there is afrom_clause
, afrom
keyword. This is implemented asRequirements
Setup a rust development environment:
rustup-init
Run your first test
To make it as easy as possible for contributors to get into it, we prepared a simple test suite to quickly assert the tokens for a node.
crates/parser/src/codegen.rs
. you will find a bunch of tests there, e.g.test_simple_select
. In each test, thetest_get_node_properties
helper is called . The parameters are a valid sql inputstr
, the kind of the node we want to test for, and aVec<TokenProperty>
with the properties we expect.cd
intocrates/parser
RUST_LOG=DEBUG cargo test test_simple_select
. you will see both the entire graph, as well as the node you are testing against within the logs. The test should pass, since we are expecting only theSelect
keyword.Become familiar with the codebase
A lot of logic is generated using macros, which makes it harder to find the definition of struct, enums and functions. Most important in this case are
SyntaxKind
andTokenProperty
.SyntaxKind
is an enum that holds a variant for every node and token that Postgres supports, and some custom ones that we need for parsing such asEof
(end of file). To figure out what variants exactly, you can look into the generated source ofpg_query.rs
to find out how nodes and tokens are defined. This is also useful to figure out what properties each node has. Open the filecrates/parser/src/parse/libpg_query_node.rs
and go to the definition ofuse pg_query::NodeEnum;
. The file contains all nodes and tokens. Search forenum Token
to find the latter.TokenProperty
is a struct that defines a property of a node. It is defined in the codemod forget_node_properties
and holds either a stringvalue
or akind
or both. You can create aTokenProperty
from a various types, e.g.TokenProperty::from(SyntaxKind::SelectStmt)
.Add your own test
Let's say you want to add a test for the
SelectStmt
node to contain thefrom
property when a from clause is used. We choose a simpleselect 1 from contact;
statement as our input. Since there are many nodes within this statement, we have to pass the kind of the node that we want to test:SyntaxKind::SelectStmt
. We expect two properties:Select
, andFrom
. The test now looks like:Note that
SyntaxKind
contains all nodes and token types. In this case, the implementation has already been done and the test should pass. If your sample contains a case that is not yet implemented, go tocrates/codegen/src/get_node_properties
and add the missing implementation tocustom_handlers
. You should be able to figure out what properties are relevant from the node definition generated by pg_query.rs (see above).Any help is highly appreciated. Please let us know if you have problems setting everything up. We are very happy to support and improve this guide to maximise efficiency for contributors.