tobymao / sqlglot

Python SQL Parser and Transpiler
https://sqlglot.com/
MIT License
6.7k stars 701 forks source link

Presto to spark conversion flattens subqueries with invalid ordering #3488

Closed williaster closed 5 months ago

williaster commented 5 months ago

I have this example with two nested subqueries, where the second references the first, which works if my output dialect is presto, but once the output is spark/spark2, the subqueries are flattened and the resulting order is invalid (subqueries reference subqueries that are below them)

Fully reproducible code snippet Base configuration

import sqlglot

query = (
    sqlglot.select("*")
    .from_("subquery2")
    .with_("subquery1", as_="with tmp as (select * from table0) select * from tmp")
    .with_("subquery2", as_="with tmp2 as (select * from table1 where a in subquery1) select * from tmp2")
)

query.sql(dialect="presto") yields a valid query (subquery1 referenced inside tmp2 is valid)

WITH subquery1 AS (
  WITH tmp AS (
    SELECT
      *
    FROM table0
  )
  SELECT
    *
  FROM tmp
), subquery2 AS (
  WITH tmp2 AS (
    SELECT
      *
    FROM table1
    WHERE
      a IN subquery1
  )
  SELECT
    *
  FROM tmp2
)
SELECT
  *
FROM subquery2

query.sql(dialect="spark") yields an invalid query (subquery1 referenced inside tmp2 is invalid as subquery1 is not yet defined):

WITH tmp2 AS (
  SELECT
    *
  FROM table1
  WHERE
    a IN subquery1
), tmp AS (
  SELECT
    *
  FROM table0
), subquery1 AS (
  SELECT
    *
  FROM tmp
), subquery2 AS (
  SELECT
    *
  FROM tmp2
)
SELECT
  *
FROM subquery2
tobymao commented 5 months ago

@williaster y'all should make a best practices doc to not write nested CTEs anymore

williaster commented 5 months ago

noted, thank you for the fix regardless of best practice 🙏