Open NickCrews opened 2 weeks ago
This isn't a bug, it's a particular choice: lower maintenance overhead (because we don't have to special case any particular join types) at the cost of lower type-specificity.
Why should we write additional code to make the type more precise? What is lost and what is gained by doing that?
I wouldn't say nullable is lower-specificity that non-nullable, I would say they are siblings of each other, and often incompatible. eg when doing a ibis.union(), we can't combine a non-nullable and nullable column together. Consider
nullable = ibis.table(schema={"x": "int32"})
non = ibis.table(schema={"x": "!int32"})
If I could use nullable
everywhere that I could use non
, then I would agree that the current behavior is simply "less precise". But since I cannot do this, I find the current behavior more of a problem.
I totally see how it is going to be more maintenance overhead though. It may not be worth it, especially when the workaround is so easy.
Maybe we just close this as not planned for now, and if someone wants to take the initiative and fix it, then we can see how much complexity their PR adds, and make a decision then?
Just so you see, here is where I ran into the issue:
def insert(
db: SQLBackend,
table_name: str,
obj: pd.DataFrame | ir.Table | list | dict,
*,
database: str | None = None,
overwrite: bool = False,
return_inserted: bool = False,
) -> None:
"""db.insert() with extra features.
- If new table is missing columns, add them with NULL values.
- Ensure correct column order to work around ibis issue #9249.
"""
if not isinstance(obj, ir.Table):
obj = ibis.memtable(obj)
existing_schema = db.table(table_name).schema()
obj = cast(obj, existing_schema)
# ensure correct order, per https://github.com/ibis-project/ibis/issues/9249
obj = obj.select(*existing_schema)
def do():
db.insert(table_name, obj, database=database, overwrite=overwrite)
if not return_inserted:
do()
return None
else:
# TODO: ideally this would use RETURNING sql syntax,
# but I don't want to reach into ibis guts to do that.
pk = _get_primary_key(db, table_name)
old = db.table(table_name).select(pk).view().cache()
do()
t = db.table(table_name)
# avoid anti join: https://github.com/ibis-project/ibis/issues/10416
return t.filter(t[pk].notin(old[pk]))
# tests:
import ibis
import pytest
from noatak import db
@pytest.fixture
def con():
return ibis.duckdb.connect()
@pytest.fixture
def existing_table(con):
con.raw_sql("CREATE TABLE test (id INT PRIMARY KEY, value INT)")
con.insert("test", {"id": [1, 2], "value": [11, 12]})
t = con.table("test")
assert t.schema() == ibis.Schema({"id": "!int32", "value": "int32"})
def test_insert_returns(existing_table, con):
to_insert = ibis.memtable({"value": [23, 24], "id": [3, 4]})
inserted = db.insert(con, "test", to_insert, return_inserted=True)
assert inserted.schema() == con.table("test").schema()
result = sorted(inserted.value.execute())
assert result == [23, 24]
# Do it again to make sure we aren't using some stale value
to_insert = ibis.memtable({"value": [25, 26], "id": [5, 6]})
inserted = db.insert(con, "test", to_insert, return_inserted=True)
assert inserted.schema() == con.table("test").schema()
result = sorted(inserted.value.execute())
assert result == [25, 26]
What happened?
I would expect this to give
!int32
, the exact same schema as t1. If I do the equivalent,t1.filter(t1.k.notin(t2.k))
, then I get!int32
as expected.I assume this is due to assumptions that after joins, any non-null columns can now be "full of holes" and thus nullable. This assumption is correct for the other other join types but not semi and anti joins.
What version of ibis are you using?
main
What backend(s) are you using, if any?
No response
Relevant log output
No response
Code of Conduct