reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.34k stars 242 forks source link

Enable support of `sqlfluff` context #548

Closed KulykDmytro closed 7 months ago

KulykDmytro commented 10 months ago

Is your feature request related to a problem? Please describe

As soon as dialect-based parsing use sqlfluff to parse SQL it will be great to support sqlfluff context (from .sqlfluff files) where we can use/set templater and other options

Describe the solution you'd like

sqlfluff.core.linter.Linter have optional config: FluffConfig parameter. Need to pass it or ensure that default is used

Additional context

dmytro.kulyk@MAC-570022 datalake % sqllineage -f transform/airflow/dags/queries/aws_transforms/aggregate/pokermatch_payment.partitions_range.sql
Traceback (most recent call last):
  File "/opt/homebrew/bin/sqllineage", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/sqllineage/cli.py", line 125, in main
    runner.print_table_lineage()
  File "/opt/homebrew/lib/python3.11/site-packages/sqllineage/runner.py", line 175, in print_table_lineage
    print(str(self))
          ^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/sqllineage/runner.py", line 26, in wrapper
    self._eval()
  File "/opt/homebrew/lib/python3.11/site-packages/sqllineage/runner.py", line 195, in _eval
    stmt_holder = analyzer.analyze(stmt, session.metadata_provider)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/sqllineage/core/parser/sqlfluff/analyzer.py", line 48, in analyze
    raise UnsupportedStatementException(
sqllineage.exceptions.UnsupportedStatementException: SQLLineage cannot parse SQL:{%- set task_id = 'ph2_transaction.data_processing' %}
{%- set transaction_date = ti.xcom_pull(task_ids=task_id, key='new_partitions') %}
{% if transaction_date|length > 0 %}
    select
        row_num,
        min(as_of_date) as start_date,
        max(as_of_date) as end_date,
        array_agg(as_of_date) as dates
    from (
        select
...

running sqlfluff directly:

dmytro.kulyk@MAC-570022 datalake % sqlfluff parse transform/airflow/dags/queries/aws_transforms/aggregate/pokermatch_payment.partitions_range.sql --verbose | more
==== sqlfluff ====
sqlfluff:                2.3.5 python:                 3.11.6
implementation:        cpython verbosity:                   1
dialect:                athena templater:               jinja
rules:                                all
=== [ path: transform/airflow/dags/queries/aws_transforms/aggregate/pokermatch_payment.partitions_range.sql ] ===

[L:  1, P:  1]      |file:
[L:  1, P:  1]      |    [META] placeholder:                                       [Type: 'templated', Raw: "{%- set task_id = 'ph2_transaction.data_processing' %}"]
[L:  1, P: 55]      |    [META] placeholder:                                       [Type: 'literal', Raw: '\n']
[L:  2, P:  1]      |    [META] placeholder:                                       [Type: 'templated', Raw: "{%- set transaction_date = ti.xcom_pull(task_ids=task_id, key='new_partitions') %}"]
[L:  2, P: 83]      |    newline:                                                  '\n'
[L:  3, P:  1]      |    [META] placeholder:                                       [Type: 'block_start', Raw: '{% if transaction_date|length > 0 %}', Block: 'd42817']
[L:  3, P: 37]      |    [META] indent:                                            [Block: 'd42817']
[L:  3, P: 37]      |    newline:                                                  '\n'
[L:  4, P:  1]      |    whitespace:                                               '    '
[L:  4, P:  5]      |    statement:
[L:  4, P:  5]      |        select_statement:
[L:  4, P:  5]      |            select_clause:
[L:  4, P:  5]      |                keyword:                                      'select'
[L:  4, P: 11]      |                [META] indent:
[L:  4, P: 11]      |                newline:                                      '\n'
[L:  5, P:  1]      |                whitespace:                                   '        '
[L:  5, P:  9]      |                select_clause_element:
[L:  5, P:  9]      |                    column_reference:
[L:  5, P:  9]      |                        naked_identifier:                     'row_num'
...

.sqlfluff example

[sqlfluff]
templater = jinja
output_line_length = 180
exclude_rules = AL07, AM02, AM03, AM05, CV10, LT08, ST05, ST06, ST07
warnings = RF01
dialect = athena
large_file_skip_char_limit = 0
large_file_skip_byte_limit = 0
max_line_length = 120

[sqlfluff:rules:references.consistent]
single_table_references = consistent

[sqlfluff:rules:references.keywords]
ignore_words =
    account, domain, language, operator, rank, result, source, type, uid, position, format, state, offset, name,
    location, role, user, comment, method, date, quarter, hour, week, weekday, days

[sqlfluff:templater:jinja]
library_path = operation/deploy/lint
apply_dbt_builtins = false
reata commented 10 months ago

It makes sense to be support analyzing templated SQL so lineage can analyzed based on what's in orchestration system instead of merely on SQL logs.

Quick question: do you require any configuration in sqlfluff other than templating? Because supports all sqlfluff configuration looks like a overkill for me.

KulykDmytro commented 9 months ago

we use both sqlfluff and sqllineage (which is using sqlfluff) and looks like templating configuration is ignored

reata commented 9 months ago

Thanks for your confirmation. I'll see what we can do.