ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.36k stars 540 forks source link

bug: unexpected SQL generated for multiple join #5514

Closed judahrand closed 1 year ago

judahrand commented 1 year ago

What happened?

I run this code snippet:

lvl1 = ibis.table([('id', 'int64'), ('slug', 'string'), ('lvl2_key', 'int64')])
lvl2 = ibis.table([('id', 'int64'), ('slug', 'string'), ('lvl3_key', 'int64')])
lvl3 = ibis.table([('id', 'int64'), ('slug', 'string')])

expr = (
    lvl1
    .join(lvl2, lvl2.id == lvl1.lvl2_key)
    .join(lvl3, lvl3.id == lvl2.lvl3_key)
    .select(
        lvl1.id.name('lvl1_id'),
        lvl1.slug.name('lvl1_name'),
        lvl2.slug.name('lvl2_name'),
        lvl3.slug.name('lvl3_name'),
    )
)
ibis.show_sql(expr)

which generates the following SQL:

SELECT
  t3.id AS lvl1_id,
  t3.slug AS lvl1_name,
  t2.slug AS lvl2_name,
  t1.slug AS lvl3_name
FROM unbound_table_1 AS t3, unbound_table_2 AS t2, (
  SELECT
    t2.id AS id_x,
    t2.slug AS slug_x,
    t2.lvl2_key AS lvl2_key,
    t3.id AS id_y,
    t3.slug AS slug_y,
    t3.lvl3_key AS lvl3_key
  FROM unbound_table_1 AS t2
  JOIN unbound_table_2 AS t3
    ON t3.id = t2.lvl2_key
) AS t0
JOIN unbound_table_3 AS t1
  ON t1.id = t2.lvl3_key

when I would expect:

SELECT
  t0.id AS lvl1_id,
  t0.slug AS lvl1_name,
  t1.slug AS lvl2_name,
  t2.slug AS lvl3_name
FROM unbound_table_1 AS t0
JOIN unbound_table_2 AS t1
  ON t1.id = t0.lvl2_key
JOIN unbound_table_3 AS t2
  ON t2.id = t1.lvl3_key

The way the SQL is currently written causes problems when the expression is reused in subsequent joins.

What version of ibis are you using?

4.1.0

What backend(s) are you using, if any?

DuckDB, BigQuery

Relevant log output

No response

Code of Conduct

krzysztof-kwitt commented 1 year ago

Ibis doesn't always generate the most simplistic SQL, because we use SQL ORM under-hood, but you can use SQLGlot to optimize your SQL query. Your SQLEngine should also generate an identical query plan for both queries

import sqlglot

from sqlglot.optimizer import optimize
import ibis

lvl1 = ibis.table([('id', 'int64'), ('slug', 'string'), ('lvl2_key', 'int64')])
lvl2 = ibis.table([('id', 'int64'), ('slug', 'string'), ('lvl3_key', 'int64')])
lvl3 = ibis.table([('id', 'int64'), ('slug', 'string')])

expr = (
    lvl1
    .join(lvl2, lvl2.id == lvl1.lvl2_key)
    .join(lvl3, lvl3.id == lvl2.lvl3_key)
    .select(
        lvl1.id.name('lvl1_id'),
        lvl1.slug.name('lvl1_name'),
        lvl2.slug.name('lvl2_name'),
        lvl3.slug.name('lvl3_name'),
    )
)
sql = str(ibis.to_sql(expr))
print("SQL")
print(sql)
print()
sqlglot_schema = {
    lvl1.get_name(): {"id": "INT", "slug": "STRING", "lvl2_key": "INT"},
    lvl2.get_name(): {"id": "INT", "slug": "STRING", "lvl3_key": "INT"},
    lvl3.get_name(): {"id": "INT", "slug": "STRING"},
}
print("SQLGlot Schema")
from pprint import pprint
pprint(sqlglot_schema)
print()
print("Optimized SQL")
print(
    optimize(
        sqlglot.parse_one(sql),
        schema=sqlglot_schema
    ).sql(pretty=True)
)
SQL
SELECT
  t3.id AS lvl1_id,
  t3.slug AS lvl1_name,
  t2.slug AS lvl2_name,
  t1.slug AS lvl3_name
FROM unbound_table_0 AS t3, unbound_table_1 AS t2, (
  SELECT
    t2.id AS id_x,
    t2.slug AS slug_x,
    t2.lvl2_key AS lvl2_key,
    t3.id AS id_y,
    t3.slug AS slug_y,
    t3.lvl3_key AS lvl3_key
  FROM unbound_table_0 AS t2
  JOIN unbound_table_1 AS t3
    ON t3.id = t2.lvl2_key
) AS t0
JOIN unbound_table_2 AS t1
  ON t1.id = t2.lvl3_key

SQLGlot Schema
{'unbound_table_0': {'id': 'INT', 'lvl2_key': 'INT', 'slug': 'STRING'},
 'unbound_table_1': {'id': 'INT', 'lvl3_key': 'INT', 'slug': 'STRING'},
 'unbound_table_2': {'id': 'INT', 'slug': 'STRING'}}

Optimized SQL
SELECT
  "unbound_table_0"."id" AS "lvl1_id",
  "unbound_table_0"."slug" AS "lvl1_name",
  "unbound_table_1"."slug" AS "lvl2_name",
  "unbound_table_2"."slug" AS "lvl3_name"
FROM "unbound_table_0" AS "unbound_table_0"
CROSS JOIN "unbound_table_1" AS "unbound_table_1"
JOIN "unbound_table_2" AS "unbound_table_2"
  ON "unbound_table_2"."id" = "unbound_table_1"."lvl3_key"
CROSS JOIN "unbound_table_0" AS "unbound_table_0_2"
JOIN "unbound_table_1" AS "unbound_table_1_2"
  ON "unbound_table_1_2"."id" = "unbound_table_0_2"."lvl2_key"
gforsyth commented 1 year ago

Hey @judahrand -- we have some ongoing tasks to improve and optimize the generated SQL that Ibis produces, but the above SQL looks correct, if not optimal, so I'm going to close this out. I'll also add that with backends like BigQuery and DuckDB, they will be optimizing that SQL a LOT, so I don't imagine you'll see a performance difference between the Ibis generated SQL and the optimal SQL.

tobymao commented 1 year ago

schema is now optional when calling optimize as long as there is no ambiguity

judahrand commented 1 year ago

@krzysztof-kwitt and @gforsyth My original example didn't quite show the bug could you please have a look at this one? It doesn't seem to happen when using the DuckDB backend but does on BigQuery. Here is the example:

itembases = client.table('iitembase', 'dataset')
subcategories = client.table('subcategory', 'dataset')
categories = client.table('category', 'dataset')
parentcategories = client.table('parentcategory', 'dataset')

# itembases = ibis.table([('id', 'int64'), ('subcategory_id', 'int64')])
# subcategories = ibis.table([('id', 'int64'), ('slug', 'string'), ('category_id', 'int64')])
# categories = ibis.table([('id', 'int64'), ('slug', 'string'), ('parent_id', 'int64')])
# parentcategories = ibis.table([('id', 'int64'), ('slug', 'string')])

category_hierarchy = (
        subcategories
        .join(categories, subcategories.category_id == categories.id)
        .join(parentcategories, categories.parent_id == parentcategories.id)
        .select(
            subcategories.id,
            subcategories.slug.name('subcategory'),
            categories.slug.name('category'),
            parentcategories.slug.name('parent_category'),
        )
    )

expr = (
    itembases
    .join(category_hierarchy, itembases.subcategory_id == category_hierarchy.id)
    .select(
        itembases,
        category_hierarchy.subcategory,
        category_hierarchy.category,
        category_hierarchy.parent_category,
    )
)

ibis.show_sql(expr)

Results in this in BigQuery:

SELECT
  t0.`id`,
  t1.`subcategory`,
  t1.`category`,
  t1.`parent_category`
FROM `project_id.dataset.itembase` AS t0
INNER JOIN (
  SELECT
    `id`,
    `slug` AS `subcategory`,
    `slug` AS `category`,
    t3.`slug` AS `parent_category`
  FROM (
    SELECT
      t4.`id` AS `id_x`,
      t4.`slug` AS `slug_x`,
      t4.`name` AS `name_x`,
      t4.`category_id`,
      t5.`id` AS `id_y`,
      t5.`slug` AS `slug_y`,
      t5.`name` AS `name_y`,
      t5.`parent_id`
    FROM `project_id.dataset.subcategory` AS t4
    INNER JOIN `project_id.dataset.category` AS t5
      ON t4.`category_id` = t5.`id`
  ) AS t2
  INNER JOIN `project_id.dataset.parentcategory` AS t3
    ON `parent_id` = t3.`id`
) AS t1
  ON t0.`subcategory_id` = t1.`id`

but this in DuckDB:

SELECT
  t0.id,
  t1.subcategory,
  t1.category,
  t1.parent_category
FROM unbound_table_4 AS t0
JOIN (
  SELECT
    t5.id AS id,
    t5.slug AS subcategory,
    t4.slug AS category,
    t3.slug AS parent_category
  FROM unbound_table_5 AS t5, unbound_table_6 AS t4, (
    SELECT
      t4.id AS id_x,
      t4.slug AS slug_x,
      t4.category_id AS category_id,
      t5.id AS id_y,
      t5.slug AS slug_y,
      t5.parent_id AS parent_id
    FROM unbound_table_5 AS t4
    JOIN unbound_table_6 AS t5
      ON t4.category_id = t5.id
  ) AS t2
  JOIN unbound_table_7 AS t3
    ON t4.parent_id = t3.id
) AS t1
  ON t0.subcategory_id = t1.id

The SQL for BigQuery is incorrect. I'm not quite sure about the DuckDB SQL but it looks right.

judahrand commented 1 year ago

I've narrowed this down to a bug in ibis.backends.base.sql.Compiler which is not present in AlchemyCompiler

judahrand commented 1 year ago
subcategories = ibis.table([('id', 'int64'), ('slug', 'string'), ('category_id', 'int64')])
categories = ibis.table([('id', 'int64'), ('slug', 'string'), ('parent_id', 'int64')])
parentcategories = ibis.table([('id', 'int64'), ('slug', 'string')])

# %%
expr = (
        subcategories
        .join(categories, subcategories.category_id == categories.id)
        .join(parentcategories, categories.parent_id == parentcategories.id)
        .select(
            subcategories.id,
            subcategories.slug.name('subcategory'),
            categories.slug.name('category'),
            parentcategories.slug.name('parent_category'),
        )
    )

print(ibis.backends.base.sql.Compiler.to_sql(expr))

Resulting in the incorrect SQL:

SELECT
  `id`,
  `slug` AS `subcategory`,
  `slug` AS `category`,
  t1.`slug` AS `parent_category`
FROM (
  SELECT
    t2.`id` AS `id_x`,
    t2.`slug` AS `slug_x`,
    t2.`category_id`,
    t3.`id` AS `id_y`,
    t3.`slug` AS `slug_y`,
    t3.`parent_id`
  FROM unbound_table_0 AS t2
  INNER JOIN unbound_table_1 AS t3
    ON t2.`category_id` = t3.`id`
) AS t0
INNER JOIN unbound_table_2 AS t1
  ON `parent_id` = t1.`id`

You can see that the top level SELECT is wrong since slug AS subcategory, slug AS category will result in the same data twice. This is the worst kind of error as it will be incorrect silently!

judahrand commented 1 year ago

For the same expression the AlchemyCompiler maybe gets it right with:

SELECT
  t3.id,
  t3.slug AS subcategory,
  t2.slug AS category,
  t1.slug AS parent_category
FROM (
  SELECT
    t2.id AS id_x,
    t2.slug AS slug_x,
    t2.category_id AS category_id,
    t3.id AS id_y,
    t3.slug AS slug_y,
    t3.parent_id AS parent_id
  FROM unbound_table_0 AS t2
  JOIN unbound_table_1 AS t3
    ON t2.category_id = t3.id
) AS t0, unbound_table_0 AS t3, unbound_table_1 AS t2
JOIN unbound_table_2 AS t1
  ON t2.parent_id = t1.id

but to be honest that SQL also looks quite odd to me given that t0 just isn't used.

judahrand commented 1 year ago

Having tested the AlchemyCompiler SQL it is also incorrect!

judahrand commented 1 year ago

@gforsyth Is there something wrong with how I am expressing the joins? It feels to me like this is a bug with a combination of 'multiple joins' with 'overlapping columns'.

gforsyth commented 1 year ago

Hey @judahrand -- there may very well still be a bug in the compiler, but there are a few things to do here to clean up your query for Ibis.

Referring to columns in a previously-chained join by their parent table is not well defined (we're planning to remove the ability to do this at all).

expr = subcategories.join(categories, subcategories.category_id == categories.id).select(subcategories.category_id)

can cause problems. For this case, you'll want to use deferred (https://ibis-project.org/api/expressions/top_level/?h=deferred) to refer to the columns of previously chained operations, e.g.

from ibis import _

expr = subcategories.join(categories, subcategories.category_id == categories.id).select(_.category_id)

Also, when tables have colliding column names, the automatically get suffixed with (by default) _x and _y (for the left and right tables, respectively). So when you refer to subcategories.id in the joined tables, that doesn't make a whole lot of sense to ibis, since there's instead going to be a column called (maybe) id_x instead.

If I rename the columns in your example query to have no name collisions (just to avoid a bunch of intermediate renaming) and use the Deferred _ instead, I get this:


[ins] In [19]: subcategories = ibis.table(
          ...:     [("s_id", "int64"), ("s_slug", "string"), ("category_id", "int64")]
          ...: )
          ...: categories = ibis.table(
          ...:     [("c_id", "int64"), ("c_slug", "string"), ("parent_id", "int64")]
          ...: )
          ...: parentcategories = ibis.table([("p_id", "int64"), ("p_slug", "string")])

[ins] In [20]: expr = (
          ...:     subcategories.join(categories, subcategories.category_id == categories.c_id)
          ...:     .join(parentcategories, _.parent_id == parentcategories.p_id)
          ...:     .select(
          ...:         _.s_id,
          ...:         _.s_slug.name("subcategory"),
          ...:         _.c_slug.name("category"),
          ...:         _.p_slug.name("parent_category"),
          ...:     )
          ...: )

[ins] In [21]: ibis.show_sql(expr, dialect="duckdb")
SELECT
  t0.s_id,
  t0.s_slug AS subcategory,
  t1.c_slug AS category,
  t2.p_slug AS parent_category
FROM unbound_table_6 AS t0
JOIN unbound_table_7 AS t1
  ON t0.category_id = t1.c_id
JOIN unbound_table_8 AS t2
  ON t1.parent_id = t2.p_id
judahrand commented 1 year ago

Hey @judahrand -- there may very well still be a bug in the compiler, but there are a few things to do here to clean up your query for Ibis.

Referring to columns in a previously-chained join by their parent table is not well defined (we're planning to remove the ability to do this at all).

expr = subcategories.join(categories, subcategories.category_id == categories.id).select(subcategories.category_id)

can cause problems. For this case, you'll want to use deferred (https://ibis-project.org/api/expressions/top_level/?h=deferred) to refer to the columns of previously chained operations, e.g.

from ibis import _

expr = subcategories.join(categories, subcategories.category_id == categories.id).select(_.category_id)

Why is it not well defined? In SQL I could easily write the query referring to parent tables. It is only the fact that Ibis is inserting an unneeded subquery which renames columns which makes this undefined and nonsensical.

Also, when tables have colliding column names, the automatically get suffixed with (by default) _x and _y (for the left and right tables, respectively). So when you refer to subcategories.id in the joined tables, that doesn't make a whole lot of sense to ibis, since there's instead going to be a column called (maybe) id_x instead.

If I rename the columns in your example query to have no name collisions (just to avoid a bunch of intermediate renaming) and use the Deferred _ instead, I get this:

[ins] In [19]: subcategories = ibis.table(
          ...:     [("s_id", "int64"), ("s_slug", "string"), ("category_id", "int64")]
          ...: )
          ...: categories = ibis.table(
          ...:     [("c_id", "int64"), ("c_slug", "string"), ("parent_id", "int64")]
          ...: )
          ...: parentcategories = ibis.table([("p_id", "int64"), ("p_slug", "string")])

[ins] In [20]: expr = (
          ...:     subcategories.join(categories, subcategories.category_id == categories.c_id)
          ...:     .join(parentcategories, _.parent_id == parentcategories.p_id)
          ...:     .select(
          ...:         _.s_id,
          ...:         _.s_slug.name("subcategory"),
          ...:         _.c_slug.name("category"),
          ...:         _.p_slug.name("parent_category"),
          ...:     )
          ...: )

[ins] In [21]: ibis.show_sql(expr, dialect="duckdb")
SELECT
  t0.s_id,
  t0.s_slug AS subcategory,
  t1.c_slug AS category,
  t2.p_slug AS parent_category
FROM unbound_table_6 AS t0
JOIN unbound_table_7 AS t1
  ON t0.category_id = t1.c_id
JOIN unbound_table_8 AS t2
  ON t1.parent_id = t2.p_id

The automatic suffixing with _x and _y is unnecessary in this query though since the final select is well defined. And requiring intermediate select clauses is really unintuitive from a SQL programmer's perspective.

I think that the issue here is that the deduplication of column names shouldn't happen when it isn't needed. Are you saying it is not possible to infer when this is the case?

gforsyth commented 1 year ago

The expression is evaluated left-to-right, so when you hit the first name collision they'll be de-duplicated. It would be possible to avoid this (I think) by not materializing the columns in the join until they're needed, although we found that that behavior is highly unintuitive for users who aren't SQL programmers.

judahrand commented 1 year ago

I see. A classic case of: "You can't please everybody." 🤔

gforsyth commented 1 year ago

Yeah. We can try to mitigate things slightly by adding more caveats and notes to the Sql -> ibis tutorial.

judahrand commented 1 year ago

Honestly, I'm having real issues with the join column deduplications... It's an absolute nightmare for cases where you effectively want to append a column using LEFT JOIN.

A simple query which I feel should be able to be expressed as:

(
    base_table
    .left_join(property_table_1, 'base_table_id')
    .left_join(property_table_2, 'base_table_id')
    .left_join(property_table_3, 'base_table_id')
    .left_join(property_table_4, 'base_table_id')
    .select(
        base_table.base_table_id,
        property_table_2column_1,
        property_table_2.column_2,
        property_table_3.column_3,
        property_table_3.column_4,
        property_table_4.column_5,
        property_table_4.column_6,
    )
)

instead has to be expressed as

(
    base_table
    .left_join(property_table_1, 'base_table_id')
    .select(
        _.base_table_id_x.name('base_table_id'),
        property_table_1.column_1,
        property_table_1.column_2,
    )
    .left_join(property_table_2, 'base_table_id')
    .select(
        _.base_table_id_x.name('base_table_id'),
        _.column_1,
        _.column_2,
        property_table2.column_3,
        property_table_2.column_4,
    )
    .left_join(property_table_3, 'base_table_id')
    .select(
        _.base_table_id_x.name('base_table_id'),
        _.column_1,
        _.column_2,
        _.column_3,
        _.column_4,
       property_table_3.column_5,
    )
    .left_join(property_table_4, 'base_table_id')
    .select(
        _.base_table_id_x.name('base_table_id'),
        _.column_1,
        _.column_2,
        _.column_3,
        _.column_4,
        _.column_5,
        property_table_4.column_6,
    )
)

which feels a little absurd.

jcrist commented 1 year ago

I can see how that would be annoying. For your particular case, I think the following should work:

(
    base_table
    .left_join(property_table_2.select("base_table_id", "column_1", "column_2"), "base_table_id")
    .left_join(property_table_3.select("base_table_id", "column_3", "column_4"), 'base_table_id')
    .left_join(property_table_4.select("base_table_id", "column_5", "column_6"), 'base_table_id')
)

By pre-selecting only the columns you want before the join, there are no conflicts except the join columns, so no de-duplication happens.

judahrand commented 1 year ago

Pretty sure it still won't work because deduplication is only skipped for inner joins.

judahrand commented 1 year ago

Generally, my feeling is that chained joins are pretty unintuitively handled at the moment especially with any duplicate columns present. Sometimes duplicates are removed and sometimes they aren't even when one might expect them to be (like with the left join example I gave). There is are no useful error messages to let a user know that they're doing things wrong (ie. referencing source tables when chaining joins which works sometimes but other times results in quietly broken SQL; referencing a column which you didn't know was a duplicate without a suffix).

This is making it really difficult for me to find Ibis useful 😢 I really really want to love it as I'm a massive fan of the ideas of:

judahrand commented 1 year ago

Honestly, I'd really prefer Ibis to behave like SQL and not try to guess at what I want. If there are duplicate columns which can't be logically resolved then error out at query time and tell the user they need to resolve the duplicates. Making magic _x and _y columns appear is what Pandas does but it isn't very helpful in my experience.

jcrist commented 1 year ago

Pretty sure it still will because deduplication is only skipped for inner joins.

Ah, yeah, that's true. We can only avoid deduplication for inner joins since other joins would need a table qualifier to know which table you're referring to (they're only guaranteed to be the same for an inner join).

Here's another option that does work (assuming you want to keep joining on the base table's id column). This only renames duplicate columns in the right table, so the left table keeps the original name:

In [1]: import ibis                                                                                                                                                                          

In [2]: t1 = ibis.table({"id": "int"})                                                                                                                                                       

In [3]: t2 = ibis.table({"id": "int", "col1": "int", "col2": "int"})                                                                                                                         

In [4]: t3 = ibis.table({"id": "int", "col3": "int", "col4": "int"})                                                                                                                         

In [5]: t4 = ibis.table({"id": "int", "col5": "int", "col6": "int"})                                                                                                                         

In [6]: q = (                                                                                                                                                                                
   ...:     t1                                                                                                                                                                               
   ...:     .left_join(t2, "id", suffixes=("", "_2"))
   ...:     .left_join(t3, "id", suffixes=("", "_3"))
   ...:     .left_join(t4, "id", suffixes=("", "_4"))
   ...: ).select("id", "col1", "col2", "col3", "col4", "col5", "col6")

Sometimes duplicates are removed and sometimes they aren't even when one might expect them to be (like with the left join example I gave).

We can only avoid deduplicating columns for inner joins, since otherwise the columns may differ and the expression is ambiguous.

There is are no useful error messages to let a user know that they're doing things wrong (ie. referencing source tables when chaining joins which works sometimes but other times results in quietly broken SQL; referencing a column which you didn't know was a duplicate without a suffix).

We definitely want to error nicely and point the user towards correct patterns. Chained joins are a rough edge right now as you've found.

judahrand commented 1 year ago

We definitely want to error nicely and point the user towards correct patterns. Chained joins are a rough edge right now as you've found.

I think joins in general are the rough edge at the moment. The duplicate columns problem also results in unhelpful error messages.

Here's another option that does work (assuming you want to keep joining on the base table's id column). This only renames duplicate columns in the right table, so the left table keeps the original name:

You're right, if I want to keep trying with Ibis this is probably the best option for now.

jcrist commented 1 year ago

I've opened #5537 to track discussions for how we can improve our ergonomics here while also keeping with our dataframe-like computational model.

sergei3000 commented 1 year ago

Honestly, I'd really prefer Ibis to behave like SQL and not try to guess at what I want. If there are duplicate columns which can't be logically resolved then error out at query time and tell the user they need to resolve the duplicates. Making magic _x and _y columns appear is what Pandas does but it isn't very helpful in my experience.

+1 on this. I think making ibis behave more like SQL than like pandas would make this project useful to a much wider audience.

gforsyth commented 1 year ago

Thanks for the input, @sergei3000 ! I'm going to go ahead and close this out in favor of the discussions in #5537