reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.19k stars 215 forks source link

Greenplum LATERAL Subquery doesn't work #613

Open kkozhakin opened 1 month ago

kkozhakin commented 1 month ago

Describe the bug LATERAL with VALUES doesn't work

SQL Paste the SQL text here. For example:

CREATE TEMPORARY TABLE result_table
    ON COMMIT DROP AS
SELECT
    a.f1,
    a.f2,
    b.f1 as f3,
    b.f2 as f4,
    b.f3 as f5
FROM tmp_base AS a
CROSS JOIN LATERAL (
VALUES
    ('s1', a.f3, a.f9),
    ('s2', a.f4, a.f10),
    ('s3', a.f5, a.f11),
    ('s4', a.f6, a.f12),
    ('s5', a.f7, a.f13),
    ('s6', a.f8, a.f14)
) AS b (f1, f2, f3);

INSERT INTO "schema_1"."table_1"
SELECT * FROM "result_table";

To Reproduce Note here we refer to SQL provided in prior step as stored in a file named test.sql

from sqllineage.runner import LineageRunner
lr_sqlfluff = LineageRunner(
    sql, dialect='greenplum', silent_mode=True
)

for path in lr_sqlfluff.get_column_lineage(False):
    print(" <- ".join(str(col) for col in reversed(path)))
2024-03-21 09:36:17.221 pid=590 MainThread run_id=998b863b-7a77-44c4-b74f-72e621b1fe19   ERROR __main__: Lineage for task delivery_etl.delivery_gp_rep_supply_rep_supply_finance_pivoted: FAIL
Traceback (most recent call last):
  File "/place/sandbox-data/tasks/1/4/2232947541/mounted_arcadia/taxi/dmp/dwh/services/meta_etl/meta_etl/layer/yt/import_meta/dmp/collect_column_lineage_sql/impl.py", line 58, in get_lineage_by_sql
    paths = get_lineage_with_timeout(prepare_sql(row['combined_query']))
  File "/place/sandbox-data/tasks/1/4/2232947541/mounted_arcadia/taxi/dmp/dwh/libraries/column_lineage/column_lineage/collect_column_lineage_sql/impl.py", line 37, in new_f
    result = f(*args, **kwargs)
  File "/place/sandbox-data/tasks/1/4/2232947541/mounted_arcadia/taxi/dmp/dwh/libraries/column_lineage/column_lineage/collect_column_lineage_sql/impl.py", line 121, in get_lineage_with_timeout
    return _get_lineage(sql)
  File "/place/sandbox-data/tasks/1/4/2232947541/mounted_arcadia/taxi/dmp/dwh/libraries/column_lineage/column_lineage/collect_column_lineage_sql/impl.py", line 129, in _get_lineage
    for path in lr.get_column_lineage(False, False):
  File "/opt/dmp/toxenvs/py3/lib/python3.9/site-packages/sqllineage/runner.py", line 26, in wrapper
    self._eval()
  File "/opt/dmp/toxenvs/py3/lib/python3.9/site-packages/sqllineage/runner.py", line 199, in _eval
    stmt_holder = analyzer.analyze(stmt, session.metadata_provider)
  File "/opt/dmp/toxenvs/py3/lib/python3.9/site-packages/sqllineage/core/parser/sqlfluff/analyzer.py", line 58, in analyze
    lineage_holder = extractor.extract(
  File "/opt/dmp/toxenvs/py3/lib/python3.9/site-packages/sqllineage/core/parser/sqlfluff/extractors/create_insert.py", line 45, in extract
    holder |= self.delegate_to_select(segment, holder)
  File "/opt/dmp/toxenvs/py3/lib/python3.9/site-packages/sqllineage/core/parser/sqlfluff/extractors/create_insert.py", line 138, in delegate_to_select
    return self.delegate_to(
  File "/opt/dmp/toxenvs/py3/lib/python3.9/site-packages/sqllineage/core/parser/sqlfluff/extractors/base.py", line 206, in delegate_to
    return extractor_cls(self.dialect, self.metadata_provider).extract(
  File "/opt/dmp/toxenvs/py3/lib/python3.9/site-packages/sqllineage/core/parser/sqlfluff/extractors/select.py", line 46, in extract
    for sq in self.list_subquery(segment):
  File "/opt/dmp/toxenvs/py3/lib/python3.9/site-packages/sqllineage/core/parser/sqlfluff/extractors/base.py", line 84, in list_subquery
    result = cls._parse_subquery(list_subqueries(segment))
  File "/opt/dmp/toxenvs/py3/lib/python3.9/site-packages/sqllineage/core/parser/sqlfluff/utils.py", line 183, in list_subqueries
    subquery += list_subqueries(from_expression_element)
  File "/opt/dmp/toxenvs/py3/lib/python3.9/site-packages/sqllineage/core/parser/sqlfluff/utils.py", line 154, in list_subqueries
    as_segment, target = extract_as_and_target_segment(segment)
  File "/opt/dmp/toxenvs/py3/lib/python3.9/site-packages/sqllineage/core/parser/sqlfluff/utils.py", line 229, in extract_as_and_target_segment
    target = sublist[0] if is_subquery(sublist[0]) else sublist[0].segments[0]
IndexError: tuple index out of range

Expected behavior A clear and concise description of what you expected to happen, and the output in accordance with the To Reproduce section.

schema_1.table_1.f1 <- tmp_base.f1
schema_1.table_1.f2 <- tmp_base.f2
schema_1.table_1.f4 <- tmp_base.f3
schema_1.table_1.f4 <- tmp_base.f4
schema_1.table_1.f4 <- tmp_base.f5
schema_1.table_1.f4 <- tmp_base.f6
schema_1.table_1.f4 <- tmp_base.f7
schema_1.table_1.f4 <- tmp_base.f8
schema_1.table_1.f5 <- tmp_base.f9
schema_1.table_1.f5 <- tmp_base.f10
schema_1.table_1.f5 <- tmp_base.f11
schema_1.table_1.f5 <- tmp_base.f12
schema_1.table_1.f5 <- tmp_base.f13
schema_1.table_1.f5 <- tmp_base.f14

Python version (available via python --version)

SQLLineage version (available via sqllineage --version):

reata commented 1 month ago

We have issue handling lateral subqueries. It also throws the same exception for simple case like:

SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
reata commented 1 month ago

With #615 merged, the exception is gone. The current output from master branch is:

<default>.result_table.f1 <- <default>.tmp_base.f1
<default>.result_table.f2 <- <default>.tmp_base.f2
<default>.result_table.f3 <- <default>.b.f1
<default>.result_table.f4 <- <default>.b.f2
<default>.result_table.f5 <- <default>.b.f3
schema_1.table_1.* <- <default>.result_table.*

Parsing column lineage from lateral subquery is another story then. <default>.b.f1 <default>.b.f2 and <default>.b.f3 here are less desirable but requires further effort.

Among the three, <default>.b.f1 comes from constant values, we have #480 to track that.