reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.3k stars 235 forks source link

fix: misidentify column name as lateral alias (#539) #540

Closed maoxingda closed 8 months ago

maoxingda commented 8 months ago

fix #539

codecov[bot] commented 8 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (a93b894) 99.50% compared to head (df79cf0) 99.50%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #540 +/- ## ======================================= Coverage 99.50% 99.50% ======================================= Files 41 41 Lines 2208 2223 +15 ======================================= + Hits 2197 2212 +15 Misses 11 11 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

maoxingda commented 8 months ago
    insert into public.tgt_tbl1
    (
        id,
        id_original
    )
    select
        a || b || c || id as id,
        id as id_original        -- # noqa: E501 TODO: I need the metadata information for the table public.src_tbl1 to identify whether the column reference 'id' in this context is from the table public.src_tbl1 or from an alias reference, currently being used as an alias reference. Note: This decision may significantly deviate from the actual scenario.
    from
        public.src_tbl1

I need the help of a pro.😜😜😜

reata commented 8 months ago

See my comment on #539 or email. Maybe we should take a step back and rethink on the approach.

maoxingda commented 8 months ago
select
    a || b || c as c,
    c as d            -- Metadata for a subquery is needed in this context to confirm whether the reference to 'c' is from the subquery or an alias reference.
from
    (
        select
            1 as a,
            2 as b,
            3 as c
    )

The current parsing method has no knowledge of the subquery in this context. Because at this point, the subquery has not yet begun to be parsed.

maoxingda commented 8 months ago

https://github.com/maoxingda/sqllineage/blob/ba0f63097ce06d20bb31e564c25af5e0e8ebfb99/sqllineage/core/parser/sqlfluff/extractors/cte.py#L67

https://github.com/maoxingda/sqllineage/blob/ba0f63097ce06d20bb31e564c25af5e0e8ebfb99/sqllineage/core/parser/sqlfluff/extractors/select.py#L77

https://github.com/maoxingda/sqllineage/blob/ba0f63097ce06d20bb31e564c25af5e0e8ebfb99/sqllineage/core/parser/sqlfluff/extractors/update.py#L76

Move the parsing of the subquery ahead of the for loop, so that the metadata information of the subquery is available. This way, can we determine whether the column references in the SELECT clause come from the subquery or lateral alias references?

maoxingda commented 8 months ago

After #552 merge, I will refactor the configuration of LATERAL_COLUMN_ALIAS_REFERENCE

reata commented 8 months ago

Not sure if this is already complete. Please re-request review once you're done.

maoxingda commented 8 months ago

Not sure if this is already complete. Please re-request review once you're done.

already done.

reata commented 8 months ago

Rebase master and make LATERAL_COLUMN_ALIAS_REFERENCE bool type.

This PR indeed makes huge changes to current design. I need to take more time to review and refactor. The part I dislike most is that we need to do different things by telling if SQLLineageConfig.LATERAL_COLUMN_ALIAS_REFERENCE and bool(self.metadata_provider) is True everywhere. It will be better we can do this in a unified way and use this condition as few as possible.

The first option I was trying is to move self.extract_subquery(subqueries, holder) up in select.py regardless. But that doesn't work yet. I left a open question, not sure if that will work. If you can help me investigate, that will be helpful.

By the way, I checked all the test cases and they look good.

maoxingda commented 8 months ago

OK, let me try.

maoxingda commented 8 months ago

already refactor done, please re-review, thanks. βœ…

maoxingda commented 8 months ago

https://www.databricks.com/blog/introducing-support-lateral-column-alias

I don’t know much about databricks but it seems that it also supports lateral column alias (LCA) reference.

maoxingda commented 8 months ago

https://sqlkover.com/cool-stuff-in-snowflake-part-4-aliasing-all-the-things/

It seems that snowflake also supports it, but unfortunately I don’t have the environment to verify it.

maoxingda commented 8 months ago

https://forums.oracle.com/ords/apexds/post/how-to-reference-column-alias-names-in-other-columns-of-a-q-7737

It seems that snowflake also supports it, but unfortunately I don’t find the official document.

reata commented 8 months ago

Popularity is one of the considerations. The major reason I'd like to move this feature as configurable is that it won't function without metadata. And the assumption is that by default sqllineage only does static code analysis and metadata is not present.

maoxingda commented 8 months ago

Agree. This feature is added because it is very pleasant to use.😜😜😜

maoxingda commented 8 months ago

πŸ‘Œ

reata commented 8 months ago

Postpone merging to next week.

Right now to_source_columns does part of the column to table/subquery resolution work. It would be better that we can handle this universally in _build_digraph method of class SQLLineageHolder.

I'm investigating whether we can move all logic to end_of_query_cleanup and not modify to_source_columns.