reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.3k stars 235 forks source link

Metadata Masked When Table was in a previous UPDATE statement #577

Closed phalcon22 closed 7 months ago

phalcon22 commented 7 months ago

Describe the bug

SQL

UPDATE `dataset.table_1` SET A = '';

CREATE OR REPLACE TEMP TABLE `table_x` AS
SELECT DISTINCT
    B
FROM `dataset.table_1`
CROSS JOIN `dataset.table_2`
;

To Reproduce Note here we refer to SQL provided in prior step as stored in a file named test.sql

import json
from sqllineage.core.metadata.dummy import DummyMetaDataProvider
from sqllineage.runner import LineageRunner

with open("test.sql") as f:
    sp = f.read()

with open('metadata.json') as file:
    metadata = json.load(file)
provider = DummyMetaDataProvider(metadata)

lineage = LineageRunner(sp, dialect="bigquery", metadata_provider=provider, verbose=True)
lineage.print_column_lineage()

Metadata.json:

{
  "dataset.table_1": [
    "A",
    "B"
  ],
  "dataset.table_2": [
    "C"
  ]
}

Result

<default>.table_x.b <- b

Expected behavior

<default>.table_x.b <- dataset.table_1.b

Python version (available via python --version)

SQLLineage version (available via sqllineage --version):

Notes

reata commented 7 months ago

The issue comes from when we handling session metadata. The intention for session metadata is that for temporary table/view created during the session, the schema info won't be available in metadata, as the table/view is not created yet.

But clearly we should limit the scope of session metadata to CREATE (and maybe INSERT). By now means should we register session metadata for UPDATE statement.

In this case, we register session metadata for dataset.table_1 through UPDATE, which doesn't include any schema information. And in the end this masks the schema info from MetaDataProvider.