reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.19k stars 215 forks source link

UNION * Fails to resolve subsequent column lineage in UNION clause #629

Open skada-coder opened 1 week ago

skada-coder commented 1 week ago

Describe the bug When parsing a UNION clause using * and the MetadataProvider class it fails to resolve subsequent source columns in the UNION clause.

Below is a simplified example.

SQL Paste the SQL text here. For example:

CREATE TABLE MAIN.FOOBAR AS (
    SELECT * FROM MAIN.BAR
    UNION
    SELECT * FROM MAIN.BAR2
    UNION
    SELECT * FROM MAIN.BAR3
)

Column Lineage fails to resolve columns from MAIN.BAR2 and MAIN.BAR3 but MAIN.BAR is fine

To Reproduce

from sqllineage.core.metadata.dummy import DummyMetaDataProvider
from sqllineage.runner import LineageRunner
metadata = {
    "main.bar": ["col1", "col2"],
    "main.bar2": ["col1", "col2"],
    "main.bar3": ["col1", "col2"]
}
provider = DummyMetaDataProvider(metadata)
sql1 = """
CREATE TABLE MAIN.FOOBAR AS (
    SELECT * FROM MAIN.BAR
    UNION
    SELECT * FROM MAIN.BAR2
    UNION
    SELECT * FROM MAIN.BAR3
)
"""
LineageRunner(sql1, metadata_provider=provider).print_column_lineage()

actual output is

main.foobar.col1 <- main.bar.col1
main.foobar.col2 <- main.bar.col2

I'd expect it to return this instead

main.foobar.col1 <- main.bar.col1
main.foobar.col2 <- main.bar.col2
main.foobar.col1 <- main.bar2.col1
main.foobar.col2 <- main.bar2.col2
main.foobar.col1 <- main.bar3.col1
main.foobar.col2 <- main.bar3.col2

Python version (available via python --version)

SQLLineage version (available via sqllineage --version):

Note that this issue does not occur in the non case where the columns are all explicit in the UNION clause so looks like it only impacts scenario.

We noticed the issue using the snowflake dialect for our own production use cases