reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.19k stars 215 forks source link

Column Level Lineage: SELECT * EXCEPT() not showing all columns (BigQuery) #576

Open phalcon22 opened 4 months ago

phalcon22 commented 4 months ago

Describe the bug

SQL

CREATE OR REPLACE TEMP TABLE `table_1` AS

SELECT
    * EXCEPT(B)
FROM `dataset.table_0`
;

To Reproduce Note here we refer to SQL provided in prior step as stored in a file named test.sql

import json
from sqllineage.core.metadata.dummy import DummyMetaDataProvider
from sqllineage.runner import LineageRunner

with open("test.sql") as f:
    sp = f.read()

with open('metadata.json') as file:
    metadata = json.load(file)
provider = DummyMetaDataProvider(metadata)

lineage = LineageRunner(sp, dialect="bigquery", metadata_provider=provider, verbose=True)
lineage.print_column_lineage()

Metadata.json:

{
    "dataset.table_0": [
        "A",
        "B",
        "C",
        "D"
      ]
}

Result

<default>.table_1.* except(b) <- dataset.table_0.* except(b)

Expected behavior

<default>.table_1.A <- dataset.table_0.A
<default>.table_1.C <- dataset.table_0.C
<default>.table_1.D <- dataset.table_0.D

Python version (available via python --version)

SQLLineage version (available via sqllineage --version):

reata commented 4 months ago

Bug confirmed. * EXCEPT(B) is parsed as column name, which is clearly wrong. To fix the issue:

  1. we need to correctly recognize this is wildcard, so expand wildcard with metadata functionality works.
  2. handle except during wildcard expansion