tobymao / sqlglot

Python SQL Parser and Transpiler
https://sqlglot.com/
MIT License
6.09k stars 611 forks source link

support for jinja2/dbt #1713

Closed RobbertDM closed 1 year ago

RobbertDM commented 1 year ago

Hey there,

I'm using sqlglot to go from oracle to trino, but in dbt. There's some specific stuff that I replace in the AST for {{ dbt macros }}, but as soon as I render it to SQL, it becomes unparseable by sqlglot because it's SQL with jinja mixed in rather than pure vanilla SQL.

I've seen that there is a tokenizer test for tokenizing jinja: https://github.com/tobymao/sqlglot/blob/1b1d9f260c95d2e8815d8a7c039fb125e24b4134/tests/test_tokens.py#L70-L76 And that some tokens like {%: BLOCK_START have been defined.

While the tokenizer works fine, parsing jinja-sql does not. Using the test itself,

jinja = """
            SELECT
               {{ x }},
               {{- x -}},
               {# it's a comment #}
               {% for x in y -%}
               a {{+ b }}
               {% endfor %};
        """

tokenizer.tokenize(jinja) works as expected, sqlglot.parse_one(jinja) tells me

ParseError: Invalid expression / Unexpected token. Line 4, Col: 8.

  SELECT
     {{ x }},
     {{- x -}},
     {# it's a comment #}
     {% for x in y -%}
     a {{+ b }}
     {% endfor %};

How would you recommend to parse jinja-sql? Would it be possible to make a custom dialect? It would be of tremendous help even if it would build a normal AST with unparsed jinja-blocks inbetween.

georgesittas commented 1 year ago

Jinja-SQL in general can't be parsed; you have to render it first in order to be able to extract a meaningful AST representation. SQLGlot only tokenizes the jinja delimiters so that one can handle such code at the token level using application-specific logic. 

If your use case is trivial, e.g. you only expect jinja expressions like {{ ... }} as projections, then you can create a custom dialect like you said and represent these expressions in the AST. However, I don't think this approach will be of any help if you want to do more advanced stuff like control flow.

RobbertDM commented 1 year ago

Hi @GeorgeSittas , thanks for the quick response. My use case is indeed trivial, not doing anything complicated. Excuse my ignorance, but I'm currently failing to make such a custom dialect.

I've tried adding {{ and }} to the tokenizer's keywords, but that doesn't help, still got: Invalid expression / Unexpected token..

What would I need to change to make this work? Should I extend the Parser class as well?

georgesittas commented 1 year ago

Should I extend the Parser class as well?

That's correct! So, there are two ways to go about this at the token level:

You also have to represent these expressions, e.g. by introducing a JinjaExpression class, and finally add logic in order to parse and generate SQL for them. Check how we handle Placeholder expressions, I think the logic will be similar.

florian-ernst-alan commented 1 year ago

I'd need something around that as well. The tokenizer works fine, but I struggle to understand how I can parse this expression correctly.

I created a JinjaExpression with an expressions arg_type, but I can't find where or how I should say "Between {{ and }}, create a JinjaExpression expressions=the parsed content".

georgesittas commented 1 year ago

Take a look at how we handle "placeholders" for the ClickHouse dialect:

https://github.com/tobymao/sqlglot/blob/5d5795d5ac7812790e95befa54b5c2bc10757934/sqlglot/dialects/clickhouse.py#L131-L150

You'd also need to update the placeholder_sql method for your use case, but this is more or less what you want to do (+ parse two L_BRACE and R_BRACE instead of just one).

florian-ernst-alan commented 1 year ago

But then how do you say "parse anything until the next }}"? For placeholders, there's a strict syntax that make it easy to match all possible tokens. I'm not sure how to handle that with an arbitrary expression.

Thanks for the reference though! I'm going to take a deeper look.

tobymao commented 1 year ago

that's what makes this challenging, you can't really do that. it'd be easier to change the tokenizer to tokenize the whole jinja block as a string

georgesittas commented 1 year ago

For placeholders, there's a strict syntax that make it easy to match all possible tokens. I'm not sure how to handle that with an arbitrary expression.

That's exactly right; like I said: Jinja-SQL is generally unparseable, so the above should only work if you're expecting to work with simple forms like {{ some_identifier }} or {{ a + simple * expression }}.

florian-ernst-alan commented 1 year ago

Okay, I get it now - thanks for the clear answer! I'll try to find a way to make it work :)

florian-ernst-alan commented 1 year ago

Now that I think about it, my scope is actually far smaller: I'm trying to parse Metabase queries that can have parameters. Curly brackets can only have 1 identifier inside, square brackets necessarily have valid SQL inside (if you remove the square brackets, the SQL must work).

Placeholder looks like a good way to handle curly brackets. Not so sure about square ones.

georgesittas commented 1 year ago

Placeholder looks like a good way to handle curly brackets. Not so sure about square ones.

If it's just arbitrary (valid) SQL, you could use the top-level parser to handle them. So, something like:

florian-ernst-alan commented 1 year ago

I'm trying to apply some transformations to the SQL code, so I won't be able to simply consume the tokens - I actually need the square brackets to stay in the AST so I can transform it back to SQL later.

I'm trying a bunch of stuff right now - the {{ }} worked perfectly fine by leveraging placeholder.