Parser trips up on common snowflake query history

peteclark3 commented 3 years ago

Currently, the parser trips up on many common snowflake query history entries like select query_text from table(information_schema.query_history()); - also queries with the rm @SNOWFLAKE_... syntax... also queries with the keyword recluster ... in the latter case, the error being syntax error at or near "recluster", at index 35 ... I am systematically removing these from analysis prior to sending to the analyzer but just FYI that without doing this, the analyzer throws an exception

vrajat commented 3 years ago

Yes. This is a known issue and happy to discuss how to approach parsing Snowflake queries better. Right now, this package uses pglast for parsing. So it parses Postgres syntax. It also does pretty well with Redshift since the dialect is very similar. A set of parsers are required to support different databases. There are a few options that I have considered and funded through freelance gigs:

Use Apache Calcite and add support for custom syntax. This worked pretty well for a gig but it is in Java. Since this package is in Python, Python/Java bridges are a pain to use.
Use lark and build a parser. Note that one doesnt have to start from scratch.

Both of these require substantial effort and I am looking for opportunities to fund it.

peteclark3 commented 3 years ago

Ok. Is it fair to say that this package doesn't really support snowflake yet? I see it in the marketing materials - https://tokern.io/data-lineage/ .. but in practice when I try the examples, I never get any result in the graph because it trips up on so many queries. If there is an implementation that does use snowflake effectively, or at least filter out the syntax that pglast doesn't support, I'd be happy to use it... but right now I think I am going to have to try a different package because it's getting to a point where I have so many query syntax exclusions that it's becoming not feasible... (even then, queries are failing and not telling me why.. I'm just getting a bunch of these now):

tokern-data-lineage |   File "/opt/pysetup/.venv/lib/python3.8/site-packages/data_lineage/server.py", line 186, in post
tokern-data-lineage |     chosen_visitor = analyze_dml_query(self._catalog, parsed, source)
tokern-data-lineage |   File "/opt/pysetup/.venv/lib/python3.8/site-packages/data_lineage/parser/__init__.py", line 54, in analyze_dml_query
tokern-data-lineage |     chosen_visitor = visit_dml_query(parsed, source)
tokern-data-lineage |   File "/opt/pysetup/.venv/lib/python3.8/site-packages/data_lineage/parser/__init__.py", line 89, in visit_dml_query
tokern-data-lineage |     v(parsed.node)
tokern-data-lineage |   File "/opt/pysetup/.venv/lib/python3.8/site-packages/pglast/visitors.py", line 197, in __call__
tokern-data-lineage |     result = method(ancestors, node)
tokern-data-lineage |   File "/opt/pysetup/.venv/lib/python3.8/site-packages/data_lineage/parser/dml_visitor.py", line 254, in visit_SelectStmt
tokern-data-lineage |     super().__call__(node.intoClause)
tokern-data-lineage |   File "/opt/pysetup/.venv/lib/python3.8/site-packages/pglast/visitors.py", line 190, in __call__
tokern-data-lineage |     ancestors, node = generator.send(None)
tokern-data-lineage |   File "/opt/pysetup/.venv/lib/python3.8/site-packages/pglast/visitors.py", line 221, in iterate
tokern-data-lineage |     raise ValueError('Bad argument, expected a ast.Node instance or a tuple')
tokern-data-lineage | ValueError: Bad argument, expected a ast.Node instance or a tuple

vrajat commented 3 years ago

Yes. The open source package as it currently stands has poor Snowflake coverage. This is because there is no good open source Snowflake SQL parser. All other OSS packages have a similar problem. I have private scripts to parse and I support snowflake as part of my freelance gigs. These scripts have too many hacks to open source. I am looking for funding to build an open source Snowflake parser.

tokern / data-lineage

Parser trips up on common snowflake query history #78