lexer is a separate module and includes all tokens, including whitespaces, by merging the output of libpg_querys scan() method with a very simple custom lexer that extracts whitespaces.
rewrite statement parsing to be resilient. instead of regular expressions, we now use a simple LL-Parser. The idea is to just check if a new statement is starting by comparing the first few tokens. Once started, we walk all tokens until either a new statement is started, eof or ";" is reached. tokens within sub-statements (enclosed by (...)) are not tested against these conditions.
while valid statements are parsed with libpg_query, we can easily implement a custom resilient parser statement by statement for invalid ones.
invalid statements are parsed "flat", meaning that we just open the node, apply all tokens and close the node.
the parser for valid statements is now very performant. we turn the ast into an untyped tree structure where each node holds its list of properties. the parser then walks the tokens once, efficiently finds the next valid node, and opens / closes nodes accordingly.
the parser vor valid statements is also "stable", meaning that it will only produce a valid cst or panic. no manual comparison required anymore.
What kind of change does this PR introduce?
complete rewrite of the parser for increased
Highlights
scan()
method with a very simple custom lexer that extracts whitespaces.(...)
) are not tested against these conditions.