ibis-project / ibis-bigquery

BigQuery backend for Ibis
Apache License 2.0
19 stars 18 forks source link

Incorrect generated (bigquery) sql for complex query #86

Open jcmincke opened 3 years ago

jcmincke commented 3 years ago
import pandas as pd
import ibis
import ibis_bigquery

ibis.options.interactive = False

conn = ibis_bigquery.connect(
    project_id=project,
    dataset_id='playground')

client = conn.client

pdf_store = pd.DataFrame({"pk_store": range(5),
                          "store": range(5),
                         })

pdf_sale = pd.DataFrame({'fk_store': range(5),
                          "fk_date": range(5),
                          "a": range(5),
                          "b": range(5),
                          "c": range(5)
                          })

pdf_date = pd.DataFrame({'pk_date': range(5),
                         "date": range(5),
                         })

store_table = 'playground.store'
sale_table = 'playground.sale'
date_table = 'playground.date'

job = client.load_table_from_dataframe(pdf_store, store_table)
job = client.load_table_from_dataframe(pdf_sale, sale_table)
job = client.load_table_from_dataframe(pdf_date, date_table)

store_t = conn.table("store", "playground")
sale_t = conn.table("sale", "playground")
date_t = conn.table("date", "playground")

# table loaded

sale_t = store_t.inner_join(sale_t, [sale_t.fk_store == store_t.pk_store]).materialize() \
    .drop(["fk_store"])

t = sale_t.inner_join(date_t, [date_t.pk_date == sale_t.fk_date]).materialize() \
    .drop(["fk_date", "pk_date"])

t1 = t.group_by(["a"]).aggregate(t.c.sum().name("c1"))
t1 = t1.mutate(b=7)

t2 = t.group_by(["b"]).aggregate(t.c.sum().name("c2"))
t2 = t2.mutate(a=9)

t1 = t.view()  # this can be replaced by t1 = t, or any other sub-expr, same error
t2 = t.view()
r = t1.inner_join(t2, ["a", "b"])[t1.a, t1.b]

print(ibis_bigquery.compile(r))

The generated sql is incorrect, look at the alias on line 15. The error seems to occur when a complex query is translated to sql that contains a WITH clause.

WITH t0 AS (
  SELECT *
  FROM `project.playground.store` t6
    INNER JOIN `project.playground.sale` t7
      ON t7.`fk_store` = t6.`pk_store`
),
t1 AS (
  SELECT `pk_store`, `store`, `fk_date`, `a`, `b`, `c`
  FROM t0
),
t2 AS (
  SELECT *
  FROM t1
    INNER JOIN `project.playground.date` t7
      ON t7.`pk_date` = t0.`fk_date`
),
t3 AS (
  SELECT `pk_store`, `store`, `a`, `b`, `c`, `date`
  FROM t2
)
SELECT t4.`a`, t4.`b`
FROM t3 t4
  INNER JOIN (
    SELECT `pk_store`, `store`, `a`, `b`, `c`, `date`
    FROM t2
  ) t5
    ON (t4.`a` = t5.`a`) AND
       (t4.`b` = t5.`b`) 
jcmincke commented 3 years ago

There is a workaround: convert a sub-expression to sql and then turn it back to a tableExpr:

e1 = ....
e2 = ....
e3 = f(e1, e2)   # sql generation fails

e1g = conn.sql(ibis_bigquery.compile(e1))
e2 = ...
e3 = f(e1g, e2)   # sql generation (might) succeed(s)