sqlfluff / sqlfluff

A modular SQL linter and auto-formatter with support for multiple dialects and templated code.
https://www.sqlfluff.com
MIT License
7.39k stars 661 forks source link

Rule AM04 not working correctly for BigQuery with Jinja #5496

Open daczarne opened 6 months ago

daczarne commented 6 months ago

Search before asking

What Happened

Rule check passes for queries that are single SELECT * when the query has Jinja params, but not when the actual params are placed. The result of the render stage is correct.

Expected Behaviour

For the test to fail

Observed Behaviour

Let's say we have the following query:

SELECT * FROM `{{ params.project_id }}.{{ params.dataset.dl }}.djini_app_data_campaign_benefits`

and the rule AM04 has been activated. When running sqlfluff lint we would expect for the result to be a fail, yet it passes.

image


If we now change the actual query file for:

SELECT * FROM `fulfillment-dwh-production.dl_dmart.djini_app_data_campaign_benefits`

the check fails as expected

image


Even more strange is that if the query contains a comment of the Jinja parametrised version of the query, it also passes:

SELECT *
-- FROM `{{ params.project_id }}.{{ params.dataset.dl }}.djini_app_data_campaign_benefits`
FROM `fulfillment-dwh-production.dl_dmart.djini_app_data_campaign_benefits`

image


I've looked at the results of the sqlfluff render command and in all instances it returns valid BigQuery SQL. Seems like the issues comes at the linting stage.

How to reproduce

My exact command is

sqlfluff lint --config dags/dmart/.sqlfluff  dags/dmart/curated_data/sql/_cln_djini_campaign_benefits.sql

Dialect

Bigquery

Version

image

Configuration

[sqlfluff]
dialect = bigquery
disable_noqa = True
encoding = utf-8
fix_even_unparsable = False
ignore_templated_areas = True
large_file_skip_byte_limit = 0
large_file_skip_char_limit = 0
max_line_length = 200
nocolor = False
output_line_length = 80
processes = 1
runaway_limit = 10
sql_file_exts = .sql
templater = jinja
verbose = 2
warn_unused_ignores = False

rules = AM04

[sqlfluff:templater:jinja:context]
params = {"project_id": "fulfillment-dwh-production", "dataset": {"dl": "dl_dmart"}}

Are you willing to work on and submit a PR to address the issue?

Code of Conduct

liudahchris commented 4 months ago

+1 also encountering this issue

liudahchris commented 4 months ago

@daczarne I spent a bunch of time yesterday looking into this (which included leaving various print statements in my local sqlfluff to debug 😅).

The short answer for resolving this issue -- you need to set ignore_templated_areas = False specifically for AM04 to run correctly. The hand-wavy explanation is that even though the SQL is rendering correctly, sqlfluff is detecting templating and thus skipping.

If you want to ignore templated areas sometimes (i.e. for other tests), the workaround I ended up going with was having a separate sqlfluff config specifically for AM04 with ignore_templated_areas = False and running lint twice.

For posterity, this is where the violation is getting dropped: https://github.com/sqlfluff/sqlfluff/blob/main/src/sqlfluff/core/linter/linter.py#L572-L573

daczarne commented 3 months ago

@liudahchris thanks for looking into this.

Do wanna raise two points that maybe can be improved in future versions of SQLFluff.

The first is that (in my opinion) the example in which the tempated code is commented out should pass regardless of whether ignore_templated_areas = False or ignore_templated_areas = True. If it's a comment (regardless of the content of the comment) it should be linted as a comment, not as code.

The second is that unfortunately, at least for my case, setting ignore_templated_areas = False is not a viable solution. The reason behind this is that if I do so, other rules start misbehaving. I start getting errors with RF01 in the following scenario.

I'm using some commonly shared function scripts (UDFs) across my project that get included in the query scripts. One of such is:

{% include '_function_custom_json_extract.sql' %}

where _function_custom_json_extract.sql is:

CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json_column STRING, key_column STRING)
RETURNS STRING
AS (

  CASE
    WHEN json_column = '{}' OR json_column IS NULL THEN NULL
    ELSE NULLIF(
      SPLIT(
        ARRAY(
          SELECT x
          FROM UNNEST(SPLIT(REPLACE(REPLACE(json_column, '"}', ''), '{"', ''), '", "')) AS x
          WHERE INSTR(x, key_column) > 0
        )[SAFE_OFFSET(0)]
        , '": "'
      )[SAFE_OFFSET(1)]
      , ''
    )
  END

);

The error states that:

L:   2 | P:   1 | RF01 | Reference 'key_column' refers to table/view not found in                                                                                                                          
                | the FROM clause or found in ancestor statement.
                | [references.from]

But in the above mentioned function key_column cannot have a reference as it is a parameter of the function.

In general, SQLFluff does not handle all the FROM statements for BigQuery correctly.