Open maoxingda opened 11 months ago
Giving aliases t3 and t4 to cte1 and cte2 respectively does not yield the correct results.
insert into
public.t3
(
c1
)
with
cte1 as (
select t1.c1 from public.t1 t1
),
cte2 as (
select t2.c1 from public.t2 t2
)
select
sq1.c1
from
(
select t3.c1 from cte1 as t3 union all select t4.c1 from cte2 as t4
) sq1
;
When using the Redshift dialect and there is a 'join_clause' in the 'from_clause' of the 'set_expression,' the lineage is also incorrect.
insert into
public.tgt_tbl1
(
id
)
with
cte1 as (
select s1.id from public.src_tbl1 as s1
),
cte2 as (
select s2.id from public.src_tbl2 as s2
)
select
sq1.id
from
(
select c1.id || s3.id as id from cte1 as c1 join public.src_tbl3 as s3 on c1.id = s3.id
union all
select c2.id from cte2 as c2
) sq1
;
Does the code on line 154 in the screenshot not take into account scenarios involving nested subqueries?
When there are two subqueries with the same name in the set_expression, the column lineage is also incorrect.
insert into
public.tgt_tbl1
(
id
)
select
sq.id
from
(
select
id
from
public.src_tbl1
) sq
union all
select
sq.id
from
(
select
id
from
public.src_tbl2
) sq
;
sqllineage.core.models.SubQuery.__str__
def __str__(self):
# return self.alias
return re.sub(r'\s+', ' ', self.query_raw).strip()
The table level lineage is incorrect, too
$ sqllineage -f test.sql --dialect=non-validating
Statements(#): 1
Source Tables:
public.t1
public.t2
Target Tables:
public.t3
$ sqllineage -f test.sql --dialect=redshift
Statements(#): 1
Source Tables:
<default>.cte1
<default>.cte2
public.t1
public.t2
Target Tables:
public.t3
We have some bugs here when handling set expression together with CTE that we mis-identify CTE as normal table. This should be fixed first.
With #488 merged, now for the SQL
insert into
public.t3
(
c1
)
with
cte1 as (
select t1.c1 from public.t1 t1
),
cte2 as (
select t2.c1 from public.t2 t2
)
select
sq1.c1
from
(
select cte1.c1 from cte1 union all select cte2.c1 from cte2
) sq1
;
non-validating and redshift generate same result for both table lineage and column lineage:
$ python -m sqllineage.cli -f test.sql --dialect=non-validating
Statements(#): 1
Source Tables:
public.t1
public.t2
Target Tables:
public.t3
$ python -m sqllineage.cli -f test.sql --dialect=redshift
Statements(#): 1
Source Tables:
public.t1
public.t2
Target Tables:
public.t3
$ python -m sqllineage.cli -f test.sql --dialect=non-validating -l column
public.t3.c1 <- sq1.c1 <- cte1.c1 <- public.t1.c1
public.t3.c1 <- sq1.c1 <- cte2.c1 <- public.t2.c1
$ python -m sqllineage.cli -f test.sql --dialect=redshift -l column
public.t3.c1 <- sq1.c1 <- cte1.c1 <- public.t1.c1
public.t3.c1 <- sq1.c1 <- cte2.c1 <- public.t2.c1
Thank you very much, boss. ✅
two subqueries with the same name in the set_expression is the only remaining buggy sql in this issue that we will take care of in #489 .
The results of column lineage using non-validating and Redshift dialects are inconsistent.
SQL
To Reproduce
Note here we refer to SQL provided in prior step as stored in a file named
test.sql
Expected behavior
Python version (available via
python --version
)SQLLineage version (available via
sqllineage --version
):Additional context