reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.28k stars 232 forks source link

incomplete source recognition in EXISTS Operator #331

Open gzukrigl opened 1 year ago

gzukrigl commented 1 year ago

First of all, thank you for your efforts. The library is really of great use for my work and the mechanism with the static lineage analysis works really very well. Unfortunatly I've found a situation, where the source columns cannot be resolved in a subquery. Sample 1: This is how our data warehouse is implemented a row level security

CREATE VIEW bal_dbo.vw_Persons AS
SELECT
    src.No No
    , src.firstName firstName
    , src.lastName lastName
    , src.sys_tenant sys_tenant
FROM dbo.vw_Persons src

where exists(
    select 1 from sysperm.permission p
    inner join sysperm.[right]   r on r.identcode  = p.rightid
    inner join sysperm.[user]    u on u.identcode  = p.userid AND u.UserName = CURRENT_USER
    inner join sysperm.tenant    t on t.identcode  = p.tenantid and t.sys_tenant = src.sys_tenant
    inner join sysperm.subject   s on s.identcode  = p.subjectid and s.[schema] = 'dbo' and s.name='vw_Persons'
)
;

sqllineage recognizes just this lineage graph:

Statement #1: CREATE VIEW bal_dbo.vw_Persons ASSELECT    src.No ...
    table read: [Table: dbo.vw_persons]
    table write: [Table: bal_dbo.vw_persons]
    table cte: []
    table drop: []
    table rename: []

Sample 2:

CREATE VIEW bal_dbo2.vw_Personen AS
SELECT
    src.No No
    , src.firstName firstName
    , src.lastName lastName
    , src.sys_tenant sys_tenant
FROM dbo2.vw_Persons src
left join (
    select 1 exists from sysperm.permission p
    inner join sysperm.[right]   r on r.identcode  = p.rightid
    inner join sysperm.[user]    u on u.identcode  = p.userid AND u.UserName = CURRENT_USER
    inner join sysperm.tenant    t on t.identcode  = p.tenantid and t.sys_tenant = src.sys_tenant
    inner join sysperm.subject   s on s.identcode  = p.subjectid and s.[schema] = 'dbo2' and s.name='vw_Persons'
) perm
where perm.exists is not null
;

Results of sample 2:

Statement #2: CREATE VIEW bal_dbo2.vw_Persons ASSELECT    src.No...
    table read: [Table: dbo2.vw_persons, Table: sysperm.[right], Table: sysperm.[user], Table: sysperm.permission, Table: sysperm.subject, Table: sysperm.tenant]
    table write: [Table: bal_dbo2.vw_persons]
    table cte: []
    table drop: []
    table rename: []

Expected behaviour: Both statements should lead to the same lineage results (where I think that the results von sample2 are more correct)

I'm not sure whether you'd expect this too. If so I can give it a try to solve this trouble.

Best Regards, Gerald

reata commented 1 year ago

Thanks for reporting this. I would expect both of them to output result as sample2 as well. The difference comes from the fact that we never wrote logic to analyze EXISTS operator so what's in EXISTS are completely ignored.

Personally I never use this feature much in daily life as oftentimes I found it hard to understand when I see other people using it. But as a library, we certainly need to support this. PR is always welcome.

Just want to mention that we're working on a huge PR for 1.4.x release, to replace the current parser sqlparse to sqlfluff, which would change the code structure drastically and eventually deprecate sqlparse in 1.5.x releases. To fix this issue, inevitably you'll work with sqlparse. So if you decide it's not worth your time doing it with the soon-to-be legacy code, you're welcome to try later.

reata commented 1 year ago

sqlfluff PR has been merged to master. Feel free to try with current code. Any contribution is appreciated.

gzukrigl commented 1 year ago

Dear reata,

Obviously i’m smart enough to find the issue:

Select * from x where exists (select 1 from b)

Brings up two read tables x and b

Select * from x where exists(select 1 from b)

Does not.

I’m also smart enough to find a fix for it:

def get_subquery_parentheses(

    token: Union[Identifier, Function, Where]

) -> List[SubQueryTuple]:

    """

    Retrieve subquery list

    the returned list is either empty when no subquery parsed or list of [parenthesis, alias] tuple

    """

    subquery = []

    asidx, as = token.token_next_by(m=(Keyword, "AS"))

    sublist = list(token.get_sublists())

    if as_ is not None and len(sublist) == 1:

        # CTE: tbl AS (SELECT 1)

        target = sublist[0]

    else:

        if isinstance(token, Function):

            # CTE without AS: tbl (SELECT 1)

            target = token.tokens[-1]

        elif isinstance(token, Where):

            # WHERE col1 IN (SELECT max(col1) FROM tab2)

            target = token

        else:

            # normal subquery: (SELECT 1) tbl

            target = token.token_first(skip_cm=True)

    if isinstance(target, (Case, Where)):

        # CASE WHEN (SELECT count() from tab1) > 0 THEN (SELECT count() FROM tab1) ELSE -1

        for tk in target.get_sublists():

            if isinstance(tk, Comparison):

                if is_subquery(tk.left):

                    subquery.append(SubQueryTuple(tk.left, tk.left.get_real_name()))

                if is_subquery(tk.right):

                    subquery.append(SubQueryTuple(tk.right, tk.right.get_real_name()))

            elif isinstance(tk, Function) and tk.get_real_name().lower() == 'exists':

                for tk in tk.tokens:

                    if isinstance(tk, Parenthesis):

                        subquery.append(SubQueryTuple(tk, token.get_real_name()))

            elif is_subquery(tk):

                subquery.append(SubQueryTuple(tk, token.get_real_name()))

    elif is_subquery(target):

        target = remove_parenthesis_between_union(target)

        subquery = [

            SubQueryTuple(get_innermost_parenthesis(target), token.get_real_name())

        ]

    return subquery

Unfortunately I’m not smart enough to use git correctly.

Can you tell me how to contribute the new code?

Best regards,

Gerald

Von: reata @.> Antworten an: reata/sqllineage @.> Datum: Samstag, 11. März 2023 um 15:08 An: reata/sqllineage @.> Cc: gzukrigl @.>, Author @.***> Betreff: Re: [reata/sqllineage] incomplete source recognition in subselect (Issue #331)

sqlfluff PR has been merged to master. Feel free to try with current code. Any contribution is appreciated.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

reata commented 1 year ago

If git is the only thing standing on your way, feel free to refer to GitHub docs:

On high level, you would want to fork my repo (make your own copy), create a branch (create a working space on your copy for this exists operator support feature), commit to your branch (make code change, looks like the code is ready), then submit a pull request (so I can review your change in my repo, seeing all the code diff).