Closed judahrand closed 1 year ago
Ibis doesn't always generate the most simplistic SQL, because we use SQL ORM under-hood, but you can use SQLGlot to optimize your SQL query. Your SQLEngine should also generate an identical query plan for both queries
import sqlglot
from sqlglot.optimizer import optimize
import ibis
lvl1 = ibis.table([('id', 'int64'), ('slug', 'string'), ('lvl2_key', 'int64')])
lvl2 = ibis.table([('id', 'int64'), ('slug', 'string'), ('lvl3_key', 'int64')])
lvl3 = ibis.table([('id', 'int64'), ('slug', 'string')])
expr = (
lvl1
.join(lvl2, lvl2.id == lvl1.lvl2_key)
.join(lvl3, lvl3.id == lvl2.lvl3_key)
.select(
lvl1.id.name('lvl1_id'),
lvl1.slug.name('lvl1_name'),
lvl2.slug.name('lvl2_name'),
lvl3.slug.name('lvl3_name'),
)
)
sql = str(ibis.to_sql(expr))
print("SQL")
print(sql)
print()
sqlglot_schema = {
lvl1.get_name(): {"id": "INT", "slug": "STRING", "lvl2_key": "INT"},
lvl2.get_name(): {"id": "INT", "slug": "STRING", "lvl3_key": "INT"},
lvl3.get_name(): {"id": "INT", "slug": "STRING"},
}
print("SQLGlot Schema")
from pprint import pprint
pprint(sqlglot_schema)
print()
print("Optimized SQL")
print(
optimize(
sqlglot.parse_one(sql),
schema=sqlglot_schema
).sql(pretty=True)
)
SQL
SELECT
t3.id AS lvl1_id,
t3.slug AS lvl1_name,
t2.slug AS lvl2_name,
t1.slug AS lvl3_name
FROM unbound_table_0 AS t3, unbound_table_1 AS t2, (
SELECT
t2.id AS id_x,
t2.slug AS slug_x,
t2.lvl2_key AS lvl2_key,
t3.id AS id_y,
t3.slug AS slug_y,
t3.lvl3_key AS lvl3_key
FROM unbound_table_0 AS t2
JOIN unbound_table_1 AS t3
ON t3.id = t2.lvl2_key
) AS t0
JOIN unbound_table_2 AS t1
ON t1.id = t2.lvl3_key
SQLGlot Schema
{'unbound_table_0': {'id': 'INT', 'lvl2_key': 'INT', 'slug': 'STRING'},
'unbound_table_1': {'id': 'INT', 'lvl3_key': 'INT', 'slug': 'STRING'},
'unbound_table_2': {'id': 'INT', 'slug': 'STRING'}}
Optimized SQL
SELECT
"unbound_table_0"."id" AS "lvl1_id",
"unbound_table_0"."slug" AS "lvl1_name",
"unbound_table_1"."slug" AS "lvl2_name",
"unbound_table_2"."slug" AS "lvl3_name"
FROM "unbound_table_0" AS "unbound_table_0"
CROSS JOIN "unbound_table_1" AS "unbound_table_1"
JOIN "unbound_table_2" AS "unbound_table_2"
ON "unbound_table_2"."id" = "unbound_table_1"."lvl3_key"
CROSS JOIN "unbound_table_0" AS "unbound_table_0_2"
JOIN "unbound_table_1" AS "unbound_table_1_2"
ON "unbound_table_1_2"."id" = "unbound_table_0_2"."lvl2_key"
Hey @judahrand -- we have some ongoing tasks to improve and optimize the generated SQL that Ibis produces, but the above SQL looks correct, if not optimal, so I'm going to close this out. I'll also add that with backends like BigQuery and DuckDB, they will be optimizing that SQL a LOT, so I don't imagine you'll see a performance difference between the Ibis generated SQL and the optimal SQL.
schema is now optional when calling optimize as long as there is no ambiguity
@krzysztof-kwitt and @gforsyth My original example didn't quite show the bug could you please have a look at this one? It doesn't seem to happen when using the DuckDB backend but does on BigQuery. Here is the example:
itembases = client.table('iitembase', 'dataset')
subcategories = client.table('subcategory', 'dataset')
categories = client.table('category', 'dataset')
parentcategories = client.table('parentcategory', 'dataset')
# itembases = ibis.table([('id', 'int64'), ('subcategory_id', 'int64')])
# subcategories = ibis.table([('id', 'int64'), ('slug', 'string'), ('category_id', 'int64')])
# categories = ibis.table([('id', 'int64'), ('slug', 'string'), ('parent_id', 'int64')])
# parentcategories = ibis.table([('id', 'int64'), ('slug', 'string')])
category_hierarchy = (
subcategories
.join(categories, subcategories.category_id == categories.id)
.join(parentcategories, categories.parent_id == parentcategories.id)
.select(
subcategories.id,
subcategories.slug.name('subcategory'),
categories.slug.name('category'),
parentcategories.slug.name('parent_category'),
)
)
expr = (
itembases
.join(category_hierarchy, itembases.subcategory_id == category_hierarchy.id)
.select(
itembases,
category_hierarchy.subcategory,
category_hierarchy.category,
category_hierarchy.parent_category,
)
)
ibis.show_sql(expr)
Results in this in BigQuery:
SELECT
t0.`id`,
t1.`subcategory`,
t1.`category`,
t1.`parent_category`
FROM `project_id.dataset.itembase` AS t0
INNER JOIN (
SELECT
`id`,
`slug` AS `subcategory`,
`slug` AS `category`,
t3.`slug` AS `parent_category`
FROM (
SELECT
t4.`id` AS `id_x`,
t4.`slug` AS `slug_x`,
t4.`name` AS `name_x`,
t4.`category_id`,
t5.`id` AS `id_y`,
t5.`slug` AS `slug_y`,
t5.`name` AS `name_y`,
t5.`parent_id`
FROM `project_id.dataset.subcategory` AS t4
INNER JOIN `project_id.dataset.category` AS t5
ON t4.`category_id` = t5.`id`
) AS t2
INNER JOIN `project_id.dataset.parentcategory` AS t3
ON `parent_id` = t3.`id`
) AS t1
ON t0.`subcategory_id` = t1.`id`
but this in DuckDB:
SELECT
t0.id,
t1.subcategory,
t1.category,
t1.parent_category
FROM unbound_table_4 AS t0
JOIN (
SELECT
t5.id AS id,
t5.slug AS subcategory,
t4.slug AS category,
t3.slug AS parent_category
FROM unbound_table_5 AS t5, unbound_table_6 AS t4, (
SELECT
t4.id AS id_x,
t4.slug AS slug_x,
t4.category_id AS category_id,
t5.id AS id_y,
t5.slug AS slug_y,
t5.parent_id AS parent_id
FROM unbound_table_5 AS t4
JOIN unbound_table_6 AS t5
ON t4.category_id = t5.id
) AS t2
JOIN unbound_table_7 AS t3
ON t4.parent_id = t3.id
) AS t1
ON t0.subcategory_id = t1.id
The SQL for BigQuery is incorrect. I'm not quite sure about the DuckDB SQL but it looks right.
I've narrowed this down to a bug in ibis.backends.base.sql.Compiler
which is not present in AlchemyCompiler
subcategories = ibis.table([('id', 'int64'), ('slug', 'string'), ('category_id', 'int64')])
categories = ibis.table([('id', 'int64'), ('slug', 'string'), ('parent_id', 'int64')])
parentcategories = ibis.table([('id', 'int64'), ('slug', 'string')])
# %%
expr = (
subcategories
.join(categories, subcategories.category_id == categories.id)
.join(parentcategories, categories.parent_id == parentcategories.id)
.select(
subcategories.id,
subcategories.slug.name('subcategory'),
categories.slug.name('category'),
parentcategories.slug.name('parent_category'),
)
)
print(ibis.backends.base.sql.Compiler.to_sql(expr))
Resulting in the incorrect SQL:
SELECT
`id`,
`slug` AS `subcategory`,
`slug` AS `category`,
t1.`slug` AS `parent_category`
FROM (
SELECT
t2.`id` AS `id_x`,
t2.`slug` AS `slug_x`,
t2.`category_id`,
t3.`id` AS `id_y`,
t3.`slug` AS `slug_y`,
t3.`parent_id`
FROM unbound_table_0 AS t2
INNER JOIN unbound_table_1 AS t3
ON t2.`category_id` = t3.`id`
) AS t0
INNER JOIN unbound_table_2 AS t1
ON `parent_id` = t1.`id`
You can see that the top level SELECT
is wrong since slug AS subcategory, slug AS category
will result in the same data twice. This is the worst kind of error as it will be incorrect silently!
For the same expression the AlchemyCompiler
maybe gets it right with:
SELECT
t3.id,
t3.slug AS subcategory,
t2.slug AS category,
t1.slug AS parent_category
FROM (
SELECT
t2.id AS id_x,
t2.slug AS slug_x,
t2.category_id AS category_id,
t3.id AS id_y,
t3.slug AS slug_y,
t3.parent_id AS parent_id
FROM unbound_table_0 AS t2
JOIN unbound_table_1 AS t3
ON t2.category_id = t3.id
) AS t0, unbound_table_0 AS t3, unbound_table_1 AS t2
JOIN unbound_table_2 AS t1
ON t2.parent_id = t1.id
but to be honest that SQL also looks quite odd to me given that t0
just isn't used.
Having tested the AlchemyCompiler
SQL it is also incorrect!
@gforsyth Is there something wrong with how I am expressing the joins? It feels to me like this is a bug with a combination of 'multiple joins' with 'overlapping columns'.
Hey @judahrand -- there may very well still be a bug in the compiler, but there are a few things to do here to clean up your query for Ibis.
Referring to columns in a previously-chained join by their parent table is not well defined (we're planning to remove the ability to do this at all).
expr = subcategories.join(categories, subcategories.category_id == categories.id).select(subcategories.category_id)
can cause problems. For this case, you'll want to use deferred
(https://ibis-project.org/api/expressions/top_level/?h=deferred) to refer to the columns of previously chained operations, e.g.
from ibis import _
expr = subcategories.join(categories, subcategories.category_id == categories.id).select(_.category_id)
Also, when tables have colliding column names, the automatically get suffixed with (by default) _x
and _y
(for the left and right tables, respectively). So when you refer to subcategories.id
in the joined tables, that doesn't make a whole lot of sense to ibis
, since there's instead going to be a column called (maybe) id_x
instead.
If I rename the columns in your example query to have no name collisions (just to avoid a bunch of intermediate renaming) and use the Deferred _
instead, I get this:
[ins] In [19]: subcategories = ibis.table(
...: [("s_id", "int64"), ("s_slug", "string"), ("category_id", "int64")]
...: )
...: categories = ibis.table(
...: [("c_id", "int64"), ("c_slug", "string"), ("parent_id", "int64")]
...: )
...: parentcategories = ibis.table([("p_id", "int64"), ("p_slug", "string")])
[ins] In [20]: expr = (
...: subcategories.join(categories, subcategories.category_id == categories.c_id)
...: .join(parentcategories, _.parent_id == parentcategories.p_id)
...: .select(
...: _.s_id,
...: _.s_slug.name("subcategory"),
...: _.c_slug.name("category"),
...: _.p_slug.name("parent_category"),
...: )
...: )
[ins] In [21]: ibis.show_sql(expr, dialect="duckdb")
SELECT
t0.s_id,
t0.s_slug AS subcategory,
t1.c_slug AS category,
t2.p_slug AS parent_category
FROM unbound_table_6 AS t0
JOIN unbound_table_7 AS t1
ON t0.category_id = t1.c_id
JOIN unbound_table_8 AS t2
ON t1.parent_id = t2.p_id
Hey @judahrand -- there may very well still be a bug in the compiler, but there are a few things to do here to clean up your query for Ibis.
Referring to columns in a previously-chained join by their parent table is not well defined (we're planning to remove the ability to do this at all).
expr = subcategories.join(categories, subcategories.category_id == categories.id).select(subcategories.category_id)
can cause problems. For this case, you'll want to use
deferred
(https://ibis-project.org/api/expressions/top_level/?h=deferred) to refer to the columns of previously chained operations, e.g.from ibis import _ expr = subcategories.join(categories, subcategories.category_id == categories.id).select(_.category_id)
Why is it not well defined? In SQL I could easily write the query referring to parent tables. It is only the fact that Ibis is inserting an unneeded subquery which renames columns which makes this undefined and nonsensical.
Also, when tables have colliding column names, the automatically get suffixed with (by default)
_x
and_y
(for the left and right tables, respectively). So when you refer tosubcategories.id
in the joined tables, that doesn't make a whole lot of sense toibis
, since there's instead going to be a column called (maybe)id_x
instead.If I rename the columns in your example query to have no name collisions (just to avoid a bunch of intermediate renaming) and use the Deferred
_
instead, I get this:[ins] In [19]: subcategories = ibis.table( ...: [("s_id", "int64"), ("s_slug", "string"), ("category_id", "int64")] ...: ) ...: categories = ibis.table( ...: [("c_id", "int64"), ("c_slug", "string"), ("parent_id", "int64")] ...: ) ...: parentcategories = ibis.table([("p_id", "int64"), ("p_slug", "string")]) [ins] In [20]: expr = ( ...: subcategories.join(categories, subcategories.category_id == categories.c_id) ...: .join(parentcategories, _.parent_id == parentcategories.p_id) ...: .select( ...: _.s_id, ...: _.s_slug.name("subcategory"), ...: _.c_slug.name("category"), ...: _.p_slug.name("parent_category"), ...: ) ...: ) [ins] In [21]: ibis.show_sql(expr, dialect="duckdb") SELECT t0.s_id, t0.s_slug AS subcategory, t1.c_slug AS category, t2.p_slug AS parent_category FROM unbound_table_6 AS t0 JOIN unbound_table_7 AS t1 ON t0.category_id = t1.c_id JOIN unbound_table_8 AS t2 ON t1.parent_id = t2.p_id
The automatic suffixing with _x
and _y
is unnecessary in this query though since the final select is well defined. And requiring intermediate select
clauses is really unintuitive from a SQL programmer's perspective.
I think that the issue here is that the deduplication of column names shouldn't happen when it isn't needed. Are you saying it is not possible to infer when this is the case?
The expression is evaluated left-to-right, so when you hit the first name collision they'll be de-duplicated. It would be possible to avoid this (I think) by not materializing the columns in the join until they're needed, although we found that that behavior is highly unintuitive for users who aren't SQL programmers.
I see. A classic case of: "You can't please everybody." 🤔
Yeah. We can try to mitigate things slightly by adding more caveats and notes to the Sql -> ibis tutorial.
Honestly, I'm having real issues with the join column deduplications... It's an absolute nightmare for cases where you effectively want to append a column using LEFT JOIN
.
A simple query which I feel should be able to be expressed as:
(
base_table
.left_join(property_table_1, 'base_table_id')
.left_join(property_table_2, 'base_table_id')
.left_join(property_table_3, 'base_table_id')
.left_join(property_table_4, 'base_table_id')
.select(
base_table.base_table_id,
property_table_2column_1,
property_table_2.column_2,
property_table_3.column_3,
property_table_3.column_4,
property_table_4.column_5,
property_table_4.column_6,
)
)
instead has to be expressed as
(
base_table
.left_join(property_table_1, 'base_table_id')
.select(
_.base_table_id_x.name('base_table_id'),
property_table_1.column_1,
property_table_1.column_2,
)
.left_join(property_table_2, 'base_table_id')
.select(
_.base_table_id_x.name('base_table_id'),
_.column_1,
_.column_2,
property_table2.column_3,
property_table_2.column_4,
)
.left_join(property_table_3, 'base_table_id')
.select(
_.base_table_id_x.name('base_table_id'),
_.column_1,
_.column_2,
_.column_3,
_.column_4,
property_table_3.column_5,
)
.left_join(property_table_4, 'base_table_id')
.select(
_.base_table_id_x.name('base_table_id'),
_.column_1,
_.column_2,
_.column_3,
_.column_4,
_.column_5,
property_table_4.column_6,
)
)
which feels a little absurd.
I can see how that would be annoying. For your particular case, I think the following should work:
(
base_table
.left_join(property_table_2.select("base_table_id", "column_1", "column_2"), "base_table_id")
.left_join(property_table_3.select("base_table_id", "column_3", "column_4"), 'base_table_id')
.left_join(property_table_4.select("base_table_id", "column_5", "column_6"), 'base_table_id')
)
By pre-selecting only the columns you want before the join, there are no conflicts except the join columns, so no de-duplication happens.
Pretty sure it still won't work because deduplication is only skipped for inner joins.
Generally, my feeling is that chained joins are pretty unintuitively handled at the moment especially with any duplicate columns present. Sometimes duplicates are removed and sometimes they aren't even when one might expect them to be (like with the left join example I gave). There is are no useful error messages to let a user know that they're doing things wrong (ie. referencing source tables when chaining joins which works sometimes but other times results in quietly broken SQL; referencing a column which you didn't know was a duplicate without a suffix).
This is making it really difficult for me to find Ibis useful 😢 I really really want to love it as I'm a massive fan of the ideas of:
Honestly, I'd really prefer Ibis to behave like SQL and not try to guess at what I want. If there are duplicate columns which can't be logically resolved then error out at query time and tell the user they need to resolve the duplicates. Making magic _x
and _y
columns appear is what Pandas does but it isn't very helpful in my experience.
Pretty sure it still will because deduplication is only skipped for inner joins.
Ah, yeah, that's true. We can only avoid deduplication for inner joins since other joins would need a table qualifier to know which table you're referring to (they're only guaranteed to be the same for an inner join).
Here's another option that does work (assuming you want to keep joining on the base table's id
column). This only renames duplicate columns in the right table, so the left table keeps the original name:
In [1]: import ibis
In [2]: t1 = ibis.table({"id": "int"})
In [3]: t2 = ibis.table({"id": "int", "col1": "int", "col2": "int"})
In [4]: t3 = ibis.table({"id": "int", "col3": "int", "col4": "int"})
In [5]: t4 = ibis.table({"id": "int", "col5": "int", "col6": "int"})
In [6]: q = (
...: t1
...: .left_join(t2, "id", suffixes=("", "_2"))
...: .left_join(t3, "id", suffixes=("", "_3"))
...: .left_join(t4, "id", suffixes=("", "_4"))
...: ).select("id", "col1", "col2", "col3", "col4", "col5", "col6")
Sometimes duplicates are removed and sometimes they aren't even when one might expect them to be (like with the left join example I gave).
We can only avoid deduplicating columns for inner joins, since otherwise the columns may differ and the expression is ambiguous.
There is are no useful error messages to let a user know that they're doing things wrong (ie. referencing source tables when chaining joins which works sometimes but other times results in quietly broken SQL; referencing a column which you didn't know was a duplicate without a suffix).
We definitely want to error nicely and point the user towards correct patterns. Chained joins are a rough edge right now as you've found.
We definitely want to error nicely and point the user towards correct patterns. Chained joins are a rough edge right now as you've found.
I think joins in general are the rough edge at the moment. The duplicate columns problem also results in unhelpful error messages.
Here's another option that does work (assuming you want to keep joining on the base table's id column). This only renames duplicate columns in the right table, so the left table keeps the original name:
You're right, if I want to keep trying with Ibis this is probably the best option for now.
I've opened #5537 to track discussions for how we can improve our ergonomics here while also keeping with our dataframe-like computational model.
Honestly, I'd really prefer Ibis to behave like SQL and not try to guess at what I want. If there are duplicate columns which can't be logically resolved then error out at query time and tell the user they need to resolve the duplicates. Making magic _x and _y columns appear is what Pandas does but it isn't very helpful in my experience.
+1 on this. I think making ibis behave more like SQL than like pandas would make this project useful to a much wider audience.
Thanks for the input, @sergei3000 ! I'm going to go ahead and close this out in favor of the discussions in #5537
What happened?
I run this code snippet:
which generates the following SQL:
when I would expect:
The way the SQL is currently written causes problems when the expression is reused in subsequent joins.
What version of ibis are you using?
4.1.0
What backend(s) are you using, if any?
DuckDB, BigQuery
Relevant log output
No response
Code of Conduct