reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.33k stars 240 forks source link

SubQuery with Same Alias Visualized as Same Node #481

Open maoxingda opened 11 months ago

maoxingda commented 11 months ago

The results of column lineage using non-validating and Redshift dialects are inconsistent.

SQL

insert into
    public.t3
(
    c1
)
with
    cte1 as (
        select t1.c1 from public.t1 t1
    ),
    cte2 as (
        select t2.c1 from public.t2 t2
    )
select
    sq1.c1
from
    (
        select cte1.c1 from cte1 union all select cte2.c1 from cte2
    ) sq1
;

To Reproduce

Note here we refer to SQL provided in prior step as stored in a file named test.sql

from sqllineage.runner import LineageRunner

with open('test.sql', 'r') as f:
    sql = f.read()

lr1 = LineageRunner(sql)
lr2 = LineageRunner(sql, dialect='redshift')

pprint(lr1.get_column_lineage())
pprint(lr2.get_column_lineage())
[(Column: public.t1.c1, Column: cte1.c1, Column: sq1.c1, Column: public.t3.c1),
 (Column: public.t2.c1, Column: cte2.c1, Column: sq1.c1, Column: public.t3.c1)]

[(Column: <default>.cte1.c1, Column: sq1.c1, Column: public.t3.c1),
 (Column: <default>.cte2.c1, Column: sq1.c1, Column: public.t3.c1)]

Expected behavior

[(Column: public.t1.c1, Column: cte1.c1, Column: sq1.c1, Column: public.t3.c1),
 (Column: public.t2.c1, Column: cte2.c1, Column: sq1.c1, Column: public.t3.c1)]

Python version (available via python --version)

SQLLineage version (available via sqllineage --version):

Additional context

Snipaste_2023-11-27_12-01-37

maoxingda commented 11 months ago

Giving aliases t3 and t4 to cte1 and cte2 respectively does not yield the correct results.

insert into
    public.t3
(
    c1
)
with
    cte1 as (
        select t1.c1 from public.t1 t1
    ),
    cte2 as (
        select t2.c1 from public.t2 t2
    )
select
    sq1.c1
from
    (
        select t3.c1 from cte1 as t3 union all select t4.c1 from cte2 as t4
    ) sq1
;
Snipaste_2023-11-26_10-04-58
maoxingda commented 11 months ago

When using the Redshift dialect and there is a 'join_clause' in the 'from_clause' of the 'set_expression,' the lineage is also incorrect.

insert into
    public.tgt_tbl1
(
    id
)
with
    cte1 as (
        select s1.id from public.src_tbl1 as s1
    ),
    cte2 as (
        select s2.id from public.src_tbl2 as s2
    )
select
    sq1.id
from
    (
        select c1.id || s3.id as id from cte1 as c1 join public.src_tbl3 as s3 on c1.id = s3.id
        union all
        select c2.id from cte2 as c2
    ) sq1
;
Snipaste_2023-11-27_10-29-49
maoxingda commented 11 months ago

Does the code on line 154 in the screenshot not take into account scenarios involving nested subqueries?

Snipaste_2023-11-27_14-12-09

maoxingda commented 11 months ago

When there are two subqueries with the same name in the set_expression, the column lineage is also incorrect.

insert into
    public.tgt_tbl1
(
    id
)
select
    sq.id
from
    (
        select
            id
        from
            public.src_tbl1
    ) sq

union all

select
    sq.id
from
    (
        select
            id
        from
            public.src_tbl2
    ) sq
;

Snipaste_2023-11-29_12-54-55

sqllineage.core.models.SubQuery.__str__

def __str__(self):
    # return self.alias
    return re.sub(r'\s+', ' ', self.query_raw).strip()

Snipaste_2023-11-29_12-55-14

reata commented 11 months ago

The table level lineage is incorrect, too

$ sqllineage -f test.sql --dialect=non-validating
Statements(#): 1
Source Tables:
    public.t1
    public.t2
Target Tables:
    public.t3

$ sqllineage -f test.sql --dialect=redshift
Statements(#): 1
Source Tables:
    <default>.cte1
    <default>.cte2
    public.t1
    public.t2
Target Tables:
    public.t3

We have some bugs here when handling set expression together with CTE that we mis-identify CTE as normal table. This should be fixed first.

reata commented 11 months ago

With #488 merged, now for the SQL

insert into
    public.t3
(
    c1
)
with
    cte1 as (
        select t1.c1 from public.t1 t1
    ),
    cte2 as (
        select t2.c1 from public.t2 t2
    )
select
    sq1.c1
from
    (
        select cte1.c1 from cte1 union all select cte2.c1 from cte2
    ) sq1
;

non-validating and redshift generate same result for both table lineage and column lineage:

$ python -m sqllineage.cli -f test.sql --dialect=non-validating
Statements(#): 1
Source Tables:
    public.t1
    public.t2
Target Tables:
    public.t3

$ python -m sqllineage.cli -f test.sql --dialect=redshift
Statements(#): 1
Source Tables:
    public.t1
    public.t2
Target Tables:
    public.t3

$ python -m sqllineage.cli -f test.sql --dialect=non-validating -l column         
public.t3.c1 <- sq1.c1 <- cte1.c1 <- public.t1.c1
public.t3.c1 <- sq1.c1 <- cte2.c1 <- public.t2.c1
$ python -m sqllineage.cli -f test.sql --dialect=redshift -l column
public.t3.c1 <- sq1.c1 <- cte1.c1 <- public.t1.c1
public.t3.c1 <- sq1.c1 <- cte2.c1 <- public.t2.c1
maoxingda commented 11 months ago

Thank you very much, boss. ✅

reata commented 11 months ago

two subqueries with the same name in the set_expression is the only remaining buggy sql in this issue that we will take care of in #489 .