Open peteclark3 opened 3 years ago
Yes. This is a known issue and happy to discuss how to approach parsing Snowflake queries better. Right now, this package uses pglast for parsing. So it parses Postgres syntax. It also does pretty well with Redshift since the dialect is very similar. A set of parsers are required to support different databases. There are a few options that I have considered and funded through freelance gigs:
Both of these require substantial effort and I am looking for opportunities to fund it.
Ok. Is it fair to say that this package doesn't really support snowflake yet? I see it in the marketing materials - https://tokern.io/data-lineage/ .. but in practice when I try the examples, I never get any result in the graph because it trips up on so many queries. If there is an implementation that does use snowflake effectively, or at least filter out the syntax that pglast doesn't support, I'd be happy to use it... but right now I think I am going to have to try a different package because it's getting to a point where I have so many query syntax exclusions that it's becoming not feasible... (even then, queries are failing and not telling me why.. I'm just getting a bunch of these now):
tokern-data-lineage | File "/opt/pysetup/.venv/lib/python3.8/site-packages/data_lineage/server.py", line 186, in post
tokern-data-lineage | chosen_visitor = analyze_dml_query(self._catalog, parsed, source)
tokern-data-lineage | File "/opt/pysetup/.venv/lib/python3.8/site-packages/data_lineage/parser/__init__.py", line 54, in analyze_dml_query
tokern-data-lineage | chosen_visitor = visit_dml_query(parsed, source)
tokern-data-lineage | File "/opt/pysetup/.venv/lib/python3.8/site-packages/data_lineage/parser/__init__.py", line 89, in visit_dml_query
tokern-data-lineage | v(parsed.node)
tokern-data-lineage | File "/opt/pysetup/.venv/lib/python3.8/site-packages/pglast/visitors.py", line 197, in __call__
tokern-data-lineage | result = method(ancestors, node)
tokern-data-lineage | File "/opt/pysetup/.venv/lib/python3.8/site-packages/data_lineage/parser/dml_visitor.py", line 254, in visit_SelectStmt
tokern-data-lineage | super().__call__(node.intoClause)
tokern-data-lineage | File "/opt/pysetup/.venv/lib/python3.8/site-packages/pglast/visitors.py", line 190, in __call__
tokern-data-lineage | ancestors, node = generator.send(None)
tokern-data-lineage | File "/opt/pysetup/.venv/lib/python3.8/site-packages/pglast/visitors.py", line 221, in iterate
tokern-data-lineage | raise ValueError('Bad argument, expected a ast.Node instance or a tuple')
tokern-data-lineage | ValueError: Bad argument, expected a ast.Node instance or a tuple
Yes. The open source package as it currently stands has poor Snowflake coverage. This is because there is no good open source Snowflake SQL parser. All other OSS packages have a similar problem. I have private scripts to parse and I support snowflake as part of my freelance gigs. These scripts have too many hacks to open source. I am looking for funding to build an open source Snowflake parser.
Currently, the parser trips up on many common snowflake query history entries like
select query_text from table(information_schema.query_history());
- also queries with therm @SNOWFLAKE_...
syntax... also queries with the keywordrecluster
... in the latter case, the error beingsyntax error at or near "recluster", at index 35
... I am systematically removing these from analysis prior to sending to the analyzer but just FYI that without doing this, the analyzer throws an exception