Closed kkozhakin closed 5 months ago
InvalidSyntaxException is raised here under the rationale that it's indeed syntax error if we use more commonly seen join clause.
CREATE TEMPORARY TABLE result_table AS
SELECT f1
FROM ( SELECT f1 FROM t1) a
LEFT JOIN ( SELECT f1 FROM t2) b
ON a.f1 = b.f1;
In this case, the SQL is not executable that most SQL engine would complain f1 is ambiguous. So it's hard to tell whether it's wrong or right if we output two column lineage paths.
However, JOIN with USING is indeed a good counter example that's syntactically correct, where two column lineage paths is more reasonable.
Neither way is perfect. Any thoughts?
I agree, that your example with "left join with on" is not syntactically correct and may not be supported.
I believe, that it would be better to support "left join with using" and "left join with on" in the same way, as I offered in the first message. So, the correct syntax is supported and this is definitely good. Broken query produces wrong lineage, but it should produce less impact on users as such queries do not work at all and should be fixed at some point of time.
Sure. Let's make the change then.
Please add some comments so those who read the code later know len(src_cols) >1
can be a correct case for "JOIN with USING" and we're making the design choice with "JOIN with ON" in mind.
I'll expect your PR. Thanks for your contribution.
create table t1 ( f1 int, f2 int);
create table t2 ( f1 int, f2 int);
CREATE TEMPORARY TABLE result_table AS
SELECT f1, f2
FROM ( SELECT f1 FROM t1)
LEFT JOIN ( SELECT f1 FROM t2) USING (f1);
Is the above SQL syntax correct? If so, what should the column lineage look like. Otherwise, treat me as if I didn’t say it.
This SQL is incorrect, but correct:
create table t1 ( f1 int, f2 int);
create table t2 ( f1 int, f2 int);
CREATE TEMPORARY TABLE result_table AS
SELECT f1, f2
FROM ( SELECT f1, f2 FROM t1)
LEFT JOIN ( SELECT f1, f2 FROM t2) USING (f1);
It is equal to:
create table t1 ( f1 int, f2 int);
create table t2 ( f1 int, f2 int);
CREATE TEMPORARY TABLE result_table AS
SELECT coalesce(tmp1.f1, tmp2.f1) as f1, coalesce(tmp1.f2, tmp2.f2) as f2
FROM ( SELECT f1, f2 FROM t1) as tmp1
LEFT JOIN ( SELECT f1, f2 FROM t2) as tmp2 ON tmp1.f1 = tmp2.f1;
Thanks for the answer.😜
It's not work with
INSERT INTO tab3
SELECT f1
FROM ( SELECT f1 FROM tab1) a
LEFT JOIN tab2 USING (f1)
It's not work with
INSERT INTO tab3 SELECT f1 FROM ( SELECT f1 FROM tab1) a LEFT JOIN tab2 USING (f1)
To completely fix this, I guess we have to recognize JOIN with USING
and JOIN with ON
, which unfortunately doesn't fit current design easily.
The other option is to force querying metadata when it's available. But that's not an elegant solution since this actually can be done by purely looking at SQL as text.
Is your feature request related to a problem? Please describe. Why lineage is not allowed from more than one table or subquery?
Lineage does not work on queries like:
Describe the solution you'd like
instad of
(https://github.com/reata/sqllineage/blob/master/sqllineage/core/holders.py#L442)
This solution may not always work correctly, just the first thing that came to mind.
Describe alternatives you've considered add option as
silent_mode