reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.19k stars 215 forks source link

column lineage parse result error #530

Open delphisharp opened 5 months ago

delphisharp commented 5 months ago

Describe the bug For example:

v_sql="""
INSERT OVERWRITE TABLE target 
SELECT NVL(PROV_CODE,'999') aaa,COUNT(DISTINCT MSISDN) bbb FROM (
SELECT NVL(PROV_CODE,'000') PROV_CODE,A.MSISDN FROM (
SELECT MSISDN FROM (
    SELECT MSISDN, BUSI_ID
    FROM source1 A
    UNION ALL
    SELECT  concat(A.MSISDN,'20230826') MSISDN,A.MUSIC_BUSI_CODE
     FROM source2 A
     union all
     SELECT concat(A.MSISDN,'20230826') MSISDN, BUSI_code
    FROM source4 A
) C GROUP BY MSISDN
) A
LEFT JOIN source3 B
ON SUBSTRING(A.MSISDN,1,7)=B.MSISDN_NBR_PAR
) T GROUP BY PROV_CODE GROUPING SETS ((),PROV_CODE);
"""
from sqllineage.runner import LineageRunner
parse = LineageRunner(sql=v_sql,dialect='sparksql')
parse.print_column_lineage()
<default>.target.aaa <- t.prov_code <- prov_code
<default>.target.bbb <- t.msisdn <- a.msisdn <- c.msisdn <- <default>.a.msisdn
<default>.target.bbb <- t.msisdn <- a.msisdn <- c.msisdn <- <default>.source1.msisdn

but if sql is lower():

from sqllineage.runner import LineageRunner
parse = LineageRunner(sql=v_sql.lower(),dialect='sparksql')
parse.print_column_lineage()
<default>.target.aaa <- t.prov_code <- prov_code
<default>.target.bbb <- t.msisdn <- a.msisdn <- c.msisdn <- <default>.source1.msisdn
<default>.target.bbb <- t.msisdn <- a.msisdn <- c.msisdn <- <default>.source2.msisdn
<default>.target.bbb <- t.msisdn <- a.msisdn <- c.msisdn <- <default>.source4.msisdn

Expected behavior

  1. when sql is upper(), column lineage bbb is error
  2. column lineage aaa is error. should be is source3

Python version (available via python --version)

SQLLineage version (available via sqllineage --version):

reata commented 5 months ago

There're two issues we need to solve.

  1. when sql is upper(), column lineage bbb is error

Looks like we have some issue with upper case alias used together with UNION. A minimal example with same issue:

INSERT OVERWRITE TABLE TARGET
SELECT MSISDN, BUSI_ID
FROM SOURCE1
UNION ALL
SELECT  CONCAT(A.MSISDN,'20230826') MSISDN, A.MUSIC_BUSI_CODE
FROM SOURCE2 A

Change alias A to lower case a generate correct output.

  1. column lineage aaa is error. should be is source3

Right we we say we don't know if aaa is from subquery a or table source3. But actually we can be smarter, because a is subquery contains only one column named msisdn, which makes table source3 the only possibility. But this "smart logic" is not in our code yet.