Open gzukrigl opened 1 year ago
Thanks for reporting this. I would expect both of them to output result as sample2 as well. The difference comes from the fact that we never wrote logic to analyze EXISTS operator so what's in EXISTS are completely ignored.
Personally I never use this feature much in daily life as oftentimes I found it hard to understand when I see other people using it. But as a library, we certainly need to support this. PR is always welcome.
Just want to mention that we're working on a huge PR for 1.4.x release, to replace the current parser sqlparse
to sqlfluff
, which would change the code structure drastically and eventually deprecate sqlparse
in 1.5.x releases. To fix this issue, inevitably you'll work with sqlparse
. So if you decide it's not worth your time doing it with the soon-to-be legacy code, you're welcome to try later.
sqlfluff PR has been merged to master. Feel free to try with current code. Any contribution is appreciated.
Dear reata,
Obviously i’m smart enough to find the issue:
Select * from x where exists (select 1 from b)
Brings up two read tables x and b
Select * from x where exists(select 1 from b)
Does not.
I’m also smart enough to find a fix for it:
def get_subquery_parentheses(
token: Union[Identifier, Function, Where]
) -> List[SubQueryTuple]:
"""
Retrieve subquery list
the returned list is either empty when no subquery parsed or list of [parenthesis, alias] tuple
"""
subquery = []
asidx, as = token.token_next_by(m=(Keyword, "AS"))
sublist = list(token.get_sublists())
if as_ is not None and len(sublist) == 1:
# CTE: tbl AS (SELECT 1)
target = sublist[0]
else:
if isinstance(token, Function):
# CTE without AS: tbl (SELECT 1)
target = token.tokens[-1]
elif isinstance(token, Where):
# WHERE col1 IN (SELECT max(col1) FROM tab2)
target = token
else:
# normal subquery: (SELECT 1) tbl
target = token.token_first(skip_cm=True)
if isinstance(target, (Case, Where)):
# CASE WHEN (SELECT count() from tab1) > 0 THEN (SELECT count() FROM tab1) ELSE -1
for tk in target.get_sublists():
if isinstance(tk, Comparison):
if is_subquery(tk.left):
subquery.append(SubQueryTuple(tk.left, tk.left.get_real_name()))
if is_subquery(tk.right):
subquery.append(SubQueryTuple(tk.right, tk.right.get_real_name()))
elif isinstance(tk, Function) and tk.get_real_name().lower() == 'exists':
for tk in tk.tokens:
if isinstance(tk, Parenthesis):
subquery.append(SubQueryTuple(tk, token.get_real_name()))
elif is_subquery(tk):
subquery.append(SubQueryTuple(tk, token.get_real_name()))
elif is_subquery(target):
target = remove_parenthesis_between_union(target)
subquery = [
SubQueryTuple(get_innermost_parenthesis(target), token.get_real_name())
]
return subquery
Unfortunately I’m not smart enough to use git correctly.
Can you tell me how to contribute the new code?
Best regards,
Gerald
Von: reata @.> Antworten an: reata/sqllineage @.> Datum: Samstag, 11. März 2023 um 15:08 An: reata/sqllineage @.> Cc: gzukrigl @.>, Author @.***> Betreff: Re: [reata/sqllineage] incomplete source recognition in subselect (Issue #331)
sqlfluff PR has been merged to master. Feel free to try with current code. Any contribution is appreciated.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
If git is the only thing standing on your way, feel free to refer to GitHub docs:
On high level, you would want to fork my repo (make your own copy), create a branch (create a working space on your copy for this exists operator support feature), commit to your branch (make code change, looks like the code is ready), then submit a pull request (so I can review your change in my repo, seeing all the code diff).
First of all, thank you for your efforts. The library is really of great use for my work and the mechanism with the static lineage analysis works really very well. Unfortunatly I've found a situation, where the source columns cannot be resolved in a subquery. Sample 1: This is how our data warehouse is implemented a row level security
sqllineage
recognizes just this lineage graph:Sample 2:
Results of sample 2:
Expected behaviour: Both statements should lead to the same lineage results (where I think that the results von sample2 are more correct)
I'm not sure whether you'd expect this too. If so I can give it a try to solve this trouble.
Best Regards, Gerald