reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.35k stars 241 forks source link

Subquery Partial Wildcard expansion breaks the column lineage path #612

Open kkozhakin opened 6 months ago

kkozhakin commented 6 months ago

Describe the bug CASE statement breaks the table sequence

SQL Paste the SQL text here. For example:

create temporary table result_table on commit drop as
with t1 as (
    select
        t.*
    from t0 as t
), t2 as (
    select
        t.*
    from t1 as t
),
 t3 as (
    select
        case
            when FALSE
            then f1
            else null
        end as f2,
        t.*
    from t2 as t
), t4 as (
    select
    *
    from t3
),

select
    *
from t4;

INSERT INTO "schema_1"."table_1"
SELECT * FROM "result_table";

To Reproduce Note here we refer to SQL provided in prior step as stored in a file named test.sql

from sqllineage.runner import LineageRunner
lr_sqlfluff = LineageRunner(
    sql, dialect='greenplum', silent_mode=True
)

for path in lr_sqlfluff.get_column_lineage(False):
    print(" <- ".join(str(col) for col in reversed(path)))
schema_1.table_1.* <- <default>.result_table.* <- t4.* <- t3.*
t2.* <- t1.* <- <default>.t0.*
t3.f1 <- t2.f1
t3.f2 <- t2.f1

Expected behavior A clear and concise description of what you expected to happen, and the output in accordance with the To Reproduce section.

schema_1.table_1.* <- <default>.result_table.* <- t4.* <- t3.*<-t2.* <- t1.* <- <default>.t0.*
t3.f1 <- t2.f1
t3.f2 <- t2.f1

Python version (available via python --version)

SQLLineage version (available via sqllineage --version):

kkozhakin commented 6 months ago

As I understand, it thinks that only f1 comes from the t2 and skips other

reata commented 6 months ago

This hits a corner case for wildcard expansion from subquery.

Can you kindly confirm if the following output is acceptable:

schema_1.table_1.* <- <default>.result_table.* <- t4.* <- t3.* <- t2.* <- t1.* <- <default>.t0.*
t3.f2 <- t2.f1

The difference is that t3.f1 <- t2.f1 is not included.

PS: A simplified test case:

WITH t2 AS (SELECT *
            FROM t1),
     t3 AS (SELECT f1,
                   *
            FROM t2)
INSERT INTO result_table
SELECT *
FROM t3;
kkozhakin commented 6 months ago

Yes, of course