reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.35k stars 241 forks source link

question about column-level lineage: column in group by clause #621

Closed hu-guanwei closed 5 months ago

hu-guanwei commented 6 months ago

Hi. About the exampe in doc,

INSERT INTO foo
SELECT a.col1,
       b.col1     AS col2,
       c.col3_sum AS col3,
       col4,
       d.*
FROM bar a
         JOIN baz b
              ON a.id = b.bar_id
         LEFT JOIN (SELECT bar_id, sum(col3) AS col3_sum
                    FROM qux
                    GROUP BY bar_id) c
                   ON a.id = sq.bar_id
         CROSS JOIN quux d;

INSERT INTO corge
SELECT a.col1,
       a.col2 + b.col2 AS col2
FROM foo a
         LEFT JOIN grault b
              ON a.col1 = b.col1;

The output is

<default>.corge.col1 <- <default>.foo.col1 <- <default>.bar.col1
<default>.corge.col2 <- <default>.foo.col2 <- <default>.baz.col1
<default>.corge.col2 <- <default>.grault.col2
<default>.foo.* <- <default>.quux.*
<default>.foo.col3 <- c.col3_sum <- <default>.qux.col3
<default>.foo.col4 <- col4

Q: The lineage of col3 is <default>.foo.col3 <- c.col3_sum <- <default>.qux.col3 shouldn't the output contain a line of <default>.foo.col3 <- c.col3_sum <- <default>.qux.bar_id? since the calculation of col3_sum also depends on bar_id

reata commented 5 months ago

As of now, column lineage is actually SELECT column lineage. JOIN/GROUP BY/WHERE is not considered yet.

We don't have a mature design for NON-SELECT column lineage for now, feel free to add your suggestion in #276.