reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.35k stars 240 forks source link

Bug when Getting column lineage on Teradata with intermediate tables #638

Open Ricardop1 opened 4 months ago

Ricardop1 commented 4 months ago

Describe the bug Relations are wrong when using intermediate tables on column level. When using a intermediate table and naming its columns, it is describing the columns as the Table/Subquery instead of the table name specified.

SQL

insert into db_test.table_target
select col_1_renamed, col_2_renamed from
(Select col1, max(col2) from db_test.table_src
where
  col1 = "AV" and
  col2 is not null
) AS Inter_table(col_1_renamed, col_2_renamed);

To Reproduce

from sqllineage.runner import LineageRunner, LineageLevel

query = """ insert into db_test.table_target
select col_1_renamed, col_2_renamed from
(Select col1, max(col2) from db_test.table_src
where
  col1 = "AV" and
  col2 is not null
) AS Inter_table(col_1_renamed, col_2_renamed);
"""
result = (
    LineageRunner(query, dialect="teradata")
)
lineage_stmt = result.to_cytoscape(LineageLevel.COLUMN)
print(lineage_stmt)
{'data': {'id': 'db_test.table_src.col1', 'parent': 'db_test.table_src', 'parent_candidates': [{'name': 'db_test.table_src', 'type': 'Table'}], 'type': 'Column'}}
{'data': {'id': '(col_1_renamed, col_2_renamed).col1', 'parent': '(col_1_renamed, col_2_renamed)', 'parent_candidates': [{'name': '(col_1_renamed, col_2_renamed)', 'type': 'SubQuery'}], 'type': 'Column'}}
{'data': {'id': 'db_test.table_src.col2', 'parent': 'db_test.table_src', 'parent_candidates': [{'name': 'db_test.table_src', 'type': 'Table'}], 'type': 'Column'}}
{'data': {'id': '(col_1_renamed, col_2_renamed).max(col2)', 'parent': '(col_1_renamed, col_2_renamed)', 'parent_candidates': [{'name': '(col_1_renamed, col_2_renamed)', 'type': 'SubQuery'}], 'type': 'Column'}}
{'data': {'id': '(col_1_renamed, col_2_renamed).col_1_renamed', 'parent': '(col_1_renamed, col_2_renamed)', 'parent_candidates': [{'name': '(col_1_renamed, col_2_renamed)', 'type': 'SubQuery'}], 'type': 'Column'}}
{'data': {'id': 'db_test.table_target.col_1_renamed', 'parent': 'db_test.table_target', 'parent_candidates': [{'name': 'db_test.table_target', 'type': 'Table'}], 'type': 'Column'}}
{'data': {'id': '(col_1_renamed, col_2_renamed).col_2_renamed', 'parent': '(col_1_renamed, col_2_renamed)', 'parent_candidates': [{'name': '(col_1_renamed, col_2_renamed)', 'type': 'SubQuery'}], 'type': 'Column'}}
{'data': {'id': 'db_test.table_target.col_2_renamed', 'parent': 'db_test.table_target', 'parent_candidates': [{'name': 'db_test.table_target', 'type': 'Table'}], 'type': 'Column'}}
{'data': {'id': 'db_test.table_src', 'type': 'Table'}}
{'data': {'id': '(col_1_renamed, col_2_renamed)', 'type': 'SubQuery'}}
{'data': {'id': 'db_test.table_target', 'type': 'Table'}}
{'data': {'id': 'e0', 'source': 'db_test.table_src.col1', 'target': '(col_1_renamed, col_2_renamed).col1'}}
{'data': {'id': 'e1', 'source': 'db_test.table_src.col2', 'target': '(col_1_renamed, col_2_renamed).max(col2)'}}
{'data': {'id': 'e2', 'source': '(col_1_renamed, col_2_renamed).col_1_renamed', 'target': 'db_test.table_target.col_1_renamed'}}
{'data': {'id': 'e3', 'source': '(col_1_renamed, col_2_renamed).col_2_renamed', 'target': 'db_test.table_target.col_2_renamed'}}

Expected behavior The code should detect Inter_table as the Subquery and not (col_1_renamed, col_2_renamed) as a whole. The column lineage as source and target should be: db_test.table_src.col1 -> Inter_table.col_1_renamed db_test.table_src.col2 -> Inter_table.col_2_renamed

Inter_table.col_1_renamed -> db_test.table_target.col_1_renamed Inter_table.col_2_renamed -> db_test.table_target.col_2_renamed

**Python version

SQLLineage version (available via sqllineage --version):

Additional context Looks like there is a problem when detecting intermediate tables