Parser/compiler backend for a SQL tag string

jimbaker commented 1 year ago

https://github.com/tobymao/sqlglot supports parsing SQL, in various dialects, into a usable AST:

>>> x = sqlglot.parse_one('select x$0x from x$1x where x$2x=x$3x')
>>> x
(SELECT expressions:
  (COLUMN this:
    (IDENTIFIER this: x$0x, quoted: False)), from:
  (FROM this:
    (TABLE this:
      (IDENTIFIER this: x$1x, quoted: False))), where:
  (WHERE this:
    (EQ this:
      (COLUMN this:
        (IDENTIFIER this: x$2x, quoted: False)), expression:
      (COLUMN this:
        (IDENTIFIER this: x$3x, quoted: False)))))
>>> x.sql()
'SELECT x$0x FROM x$1x WHERE x$2x = x$3x'

So this should suffice for capturing placeholders (note $xNx, this allows us to keep track of the placeholder number N), then provide a correct interpretation as we are currently doing with marker types in https://github.com/jimbaker/tagstr/blob/main/src/tagstr/sql.py, but with a compiler. This means being able to write

sql'create table {table_name} (name, first_appeared)'

and the parser identifies that {table_name} is an identifier, after the placeholder substitution:

>>> sqlglot.parse_one('create table x$0x (name, first_appeared)')
(CREATE this:
  (SCHEMA this:
    (TABLE this:
      (IDENTIFIER this: x$0x, quoted: False)), expressions:
    (IDENTIFIER this: name, quoted: False),
    (IDENTIFIER this: first_appeared, quoted: False)), kind: table)

pauleveritt commented 1 year ago

What other ideas might you have for this, beyond the kinds of regular DSL points made in the tutorial and PEP?

jimbaker commented 1 year ago

What other ideas might you have for this, beyond the kinds of regular DSL points made in the tutorial and PEP?

In the tutorial, we have this specific match:

            case getvalue, raw, _, _:
                match value := getvalue():
                    case SQL() | Identifier():
                        parts.append(value)
                    case _:
                        parts.append(Param(raw, value))

which results in writing statements like so:

        cur.execute(*sql'create table {Identifier(table_name)} (name, first_appeared)')

because we cannot disambiguate what table_name is without an AST parser, given the complexity of SQL.

Having said, there are counterexamples where it's still necessary, such as the where clause on an expression - is it an identifier or a value? One could do something like

where "{identifier}" = {value}

(remembering the supposed distinction between double quotes and single quotes in SQL, but see https://sqlite.org/quirks.html; SQL Server does something different, of course).

In this case an explicit Identifier(expr) can be useful.

Compare this to HTML, where there's perhaps a more natural mapping, say a dict here means to have this interpolation in style={some_value}. But even there, likely gotchas to chose the correct interpolation.

jimbaker commented 1 year ago

Interestingly the popular ETL tool dbt uses Jinja to templatize SQL. It also manages one's SQL queries and corresponding models, and allows them to be composed together more easily. All managed with git, at least with an add-on.

One take: https://codingisforlosers.com/wtf-is-dbt/ (amusing name for a web site, still some good content on usage).

There also seems to be a fair amount of overlap between the dbt approach and what we have discussing for FDOM.

jimbaker commented 1 year ago

While I had heard of dbt before, I had not actually looked it until today, after scrolling through my X/Twitter feed and seeing both templated SQL and tool support for it: https://twitter.com/imrobertyi/status/1691822118411907480

pauleveritt commented 1 year ago

That line bring a software-engineering-style discipline to your data analysis work really, really hit the mark for me. What a great phrasing.

pauleveritt / fdom

Parser/compiler backend for a SQL tag string #5