reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.19k stars 215 forks source link

Support Hive/SparkSQL Multi Table Insert Syntax #596

Open krahnikblis opened 2 months ago

krahnikblis commented 2 months ago

Describe the bug when processing SQL scripts where the grammar is backwards, LineageRunner().target_tables fails to parse/fix.

SQL Paste the SQL text here. For example:

FROM `dbname`.`tblname_src`
INSERT OVERWRITE TABLE `dbname`.`tblname_dest`
PARTITION (parcol1,parcol2)
SELECT businessgroup    ,
region  ,
source_of_opportunity   ,
pcg_rad ,
opportunity_id_d365 ,
opportunity_id  ,
opportunity_name    ,
opportunity_owner   ,
parcol1 ,
parcol2

To Reproduce Note here we refer to SQL provided in prior step as stored in a file named test.sql

from sqllineage.runner import LineageRunner
with open("test.sql") as f:
    sql = f.read()
result = LineageRunner(sql, dialect="sparksql")
print(result.target_tables)
---------------------------------------------------------------------------
InvalidSyntaxException                    Traceback (most recent call last)
Cell In[73], line 3
      1 # sql_fmt = sqlfluff.fix(sqlText, dialect="sparksql", config_path=cfg["sqlfluff_cfg"])
      2 sqllin = LineageRunner(sql=sql_fmt, dialect="sparksql")
----> 3 sqllin.target_tables

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sqllineage\runner.py:24, in lazy_method.<locals>.wrapper(*args, **kwargs)
     22 self = args[0]
     23 if not self._evaluated:
---> 24     self._eval()
     25 return func(*args, **kwargs)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sqllineage\runner.py:187, in LineageRunner._eval(self)
    182         warnings.warn(
    183             f"Dialect={self._dialect}, TSQL_NO_SEMICOLON will be ignored unless dialect is tsql"
    184         )
    185     self._stmt = split(self._sql.strip())
--> 187 self._stmt_holders = [analyzer.analyze(stmt) for stmt in self._stmt]
    188 self._sql_holder = SQLLineageHolder.of(*self._stmt_holders)
    189 self._evaluated = True

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sqllineage\runner.py:187, in <listcomp>(.0)
    182         warnings.warn(
    183             f"Dialect={self._dialect}, TSQL_NO_SEMICOLON will be ignored unless dialect is tsql"
    184         )
    185     self._stmt = split(self._sql.strip())
--> 187 self._stmt_holders = [analyzer.analyze(stmt) for stmt in self._stmt]
    188 self._sql_holder = SQLLineageHolder.of(*self._stmt_holders)
    189 self._evaluated = True

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sqllineage\core\parser\sqlfluff\analyzer.py:42, in SqlFluffLineageAnalyzer.analyze(self, sql)
     40     statement_segments = [self.tsql_split_cache[sql]]
     41 else:
---> 42     statement_segments = self._list_specific_statement_segment(sql)
     43 if len(statement_segments) == 0:
     44     raise UnsupportedStatementException(
     45         f"SQLLineage cannot parse SQL:" f"{sql}"
     46     )  # pragma: no cover

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sqllineage\core\parser\sqlfluff\analyzer.py:73, in SqlFluffLineageAnalyzer._list_specific_statement_segment(self, sql)
     71 if violations:
     72     violation_msg = "\n".join(violations)
---> 73     raise InvalidSyntaxException(
     74         f"This SQL statement is unparsable, please check potential syntax error for SQL:\n"
     75         f"{sql}\n"
     76         f"{violation_msg}"
     77     )
     78 segments = []
     79 for top_segment in getattr(parsed.tree, "segments", []):

InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL:
<SQLtext printed>
Line 1, Position 1: Found unparsable section: 'FROM `dbname`.`tblname_src`...'

Expected behavior processing the SQL as normal, or perhaps (better) put the FROM segment after SELECT [where it belongs; i know i know i didn't write the SQL just trying to organized and make sense of other peoples' stuff]

Python version (available via python --version)

SQLLineage version (available via sqllineage --version):

Additional context it looks like this library sub-packages sqlfluff within its parser? i didn't know where to log the issue, but decided here since the errors above show \site-packages\sqllineage\ for all traces. separately, i do have sqlfluff 2.3.5 installed, i don't know if this means i have 2 versions or what...

reata commented 1 month ago

This feature is known as multiple insert clauses (also known as Multi Table Insert). It's not standard ANSI SQL syntax but a Hive extension. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Syntax.1

SparkSQL also supports this syntax according to its antlr g4 file, although it's not documented: https://github.com/apache/spark/blob/v3.5.1/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L444

We need sqlfluff to support parsing this before we can add lineage analysis support. Upstream issue raise: https://github.com/sqlfluff/sqlfluff/issues/5866