reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.19k stars 215 forks source link

Allow unambiguous column reference for JOIN with USING clause #558

Closed kkozhakin closed 5 months ago

kkozhakin commented 5 months ago

Is your feature request related to a problem? Please describe. Why lineage is not allowed from more than one table or subquery?

Lineage does not work on queries like:

CREATE TEMPORARY TABLE result_table AS
SELECT f1
FROM ( SELECT f1 FROM t1)
LEFT JOIN ( SELECT f1 FROM t2) USING (f1);

Describe the solution you'd like

for src_col in src_cols:
    g.add_edge(src_col, tgt_col, type=EdgeType.LINEAGE)
g.remove_edge(unresolved_col, tgt_col)

instad of

if len(src_cols) > 1:
    raise InvalidSyntaxException(
        f"{unresolved_col.raw_name} is not allowed from more than one table or subquery"
    )

if len(src_cols) == 1:
    g.add_edge(src_cols[0], tgt_col, type=EdgeType.LINEAGE)
    g.remove_edge(unresolved_col, tgt_col)

(https://github.com/reata/sqllineage/blob/master/sqllineage/core/holders.py#L442)

This solution may not always work correctly, just the first thing that came to mind.

Describe alternatives you've considered add option as silent_mode

reata commented 5 months ago

InvalidSyntaxException is raised here under the rationale that it's indeed syntax error if we use more commonly seen join clause.

CREATE TEMPORARY TABLE result_table AS
SELECT f1
FROM ( SELECT f1 FROM t1) a
LEFT JOIN ( SELECT f1 FROM t2) b
ON a.f1 = b.f1;

In this case, the SQL is not executable that most SQL engine would complain f1 is ambiguous. So it's hard to tell whether it's wrong or right if we output two column lineage paths.

However, JOIN with USING is indeed a good counter example that's syntactically correct, where two column lineage paths is more reasonable.

Neither way is perfect. Any thoughts?

kkozhakin commented 5 months ago

I agree, that your example with "left join with on" is not syntactically correct and may not be supported.

I believe, that it would be better to support "left join with using" and "left join with on" in the same way, as I offered in the first message. So, the correct syntax is supported and this is definitely good. Broken query produces wrong lineage, but it should produce less impact on users as such queries do not work at all and should be fixed at some point of time.

reata commented 5 months ago

Sure. Let's make the change then.

Please add some comments so those who read the code later know len(src_cols) >1 can be a correct case for "JOIN with USING" and we're making the design choice with "JOIN with ON" in mind.

I'll expect your PR. Thanks for your contribution.

maoxingda commented 5 months ago
create table t1 ( f1 int, f2 int);
create table t2 ( f1 int, f2 int);

CREATE TEMPORARY TABLE result_table AS
SELECT f1, f2
FROM ( SELECT f1 FROM t1)
LEFT JOIN ( SELECT f1 FROM t2) USING (f1);

Is the above SQL syntax correct? If so, what should the column lineage look like. Otherwise, treat me as if I didn’t say it.

kkozhakin commented 5 months ago

This SQL is incorrect, but correct:

create table t1 ( f1 int, f2 int);
create table t2 ( f1 int, f2 int);

CREATE TEMPORARY TABLE result_table AS
SELECT f1, f2
FROM ( SELECT f1, f2 FROM t1)
LEFT JOIN ( SELECT f1, f2 FROM t2) USING (f1);

It is equal to:

create table t1 ( f1 int, f2 int);
create table t2 ( f1 int, f2 int);

CREATE TEMPORARY TABLE result_table AS
SELECT coalesce(tmp1.f1, tmp2.f1) as f1, coalesce(tmp1.f2, tmp2.f2) as f2
FROM ( SELECT f1, f2 FROM t1) as tmp1
LEFT JOIN ( SELECT f1, f2 FROM t2) as tmp2 ON tmp1.f1 = tmp2.f1;
maoxingda commented 5 months ago

Thanks for the answer.😜

kkozhakin commented 5 months ago

It's not work with

INSERT INTO tab3
SELECT f1
FROM ( SELECT f1 FROM tab1) a
LEFT JOIN tab2 USING (f1)
reata commented 5 months ago

It's not work with

INSERT INTO tab3
SELECT f1
FROM ( SELECT f1 FROM tab1) a
LEFT JOIN tab2 USING (f1)

To completely fix this, I guess we have to recognize JOIN with USING and JOIN with ON, which unfortunately doesn't fit current design easily.

The other option is to force querying metadata when it's available. But that's not an elegant solution since this actually can be done by purely looking at SQL as text.