ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.33k stars 599 forks source link

bug: non-nullable columns are turned into nullable after anti_join() and semi_join() #10416

Open NickCrews opened 2 weeks ago

NickCrews commented 2 weeks ago

What happened?

import ibis

t1 = ibis.table({"k": "!int32"})
t2 = ibis.table({"k": "int32"})
j = t1.anti_join(t2, "k")
j.schema()
# ibis.Schema {
#   k  int32
# }

I would expect this to give !int32, the exact same schema as t1. If I do the equivalent, t1.filter(t1.k.notin(t2.k)), then I get !int32 as expected.

I assume this is due to assumptions that after joins, any non-null columns can now be "full of holes" and thus nullable. This assumption is correct for the other other join types but not semi and anti joins.

What version of ibis are you using?

main

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

cpcloud commented 2 weeks ago

This isn't a bug, it's a particular choice: lower maintenance overhead (because we don't have to special case any particular join types) at the cost of lower type-specificity.

Why should we write additional code to make the type more precise? What is lost and what is gained by doing that?

NickCrews commented 2 weeks ago

I wouldn't say nullable is lower-specificity that non-nullable, I would say they are siblings of each other, and often incompatible. eg when doing a ibis.union(), we can't combine a non-nullable and nullable column together. Consider

nullable = ibis.table(schema={"x": "int32"})
non = ibis.table(schema={"x": "!int32"})

If I could use nullable everywhere that I could use non, then I would agree that the current behavior is simply "less precise". But since I cannot do this, I find the current behavior more of a problem.

I totally see how it is going to be more maintenance overhead though. It may not be worth it, especially when the workaround is so easy.

Maybe we just close this as not planned for now, and if someone wants to take the initiative and fix it, then we can see how much complexity their PR adds, and make a decision then?

Just so you see, here is where I ran into the issue:

def insert(
    db: SQLBackend,
    table_name: str,
    obj: pd.DataFrame | ir.Table | list | dict,
    *,
    database: str | None = None,
    overwrite: bool = False,
    return_inserted: bool = False,
) -> None:
    """db.insert() with extra features.

    - If new table is missing columns, add them with NULL values.
    - Ensure correct column order to work around ibis issue #9249.
    """
    if not isinstance(obj, ir.Table):
        obj = ibis.memtable(obj)
    existing_schema = db.table(table_name).schema()
    obj = cast(obj, existing_schema)
    # ensure correct order, per https://github.com/ibis-project/ibis/issues/9249
    obj = obj.select(*existing_schema)

    def do():
        db.insert(table_name, obj, database=database, overwrite=overwrite)

    if not return_inserted:
        do()
        return None
    else:
        # TODO: ideally this would use RETURNING sql syntax,
        # but I don't want to reach into ibis guts to do that.
        pk = _get_primary_key(db, table_name)
        old = db.table(table_name).select(pk).view().cache()
        do()
        t = db.table(table_name)
        # avoid anti join: https://github.com/ibis-project/ibis/issues/10416
        return t.filter(t[pk].notin(old[pk]))

# tests:

import ibis
import pytest

from noatak import db

@pytest.fixture
def con():
    return ibis.duckdb.connect()

@pytest.fixture
def existing_table(con):
    con.raw_sql("CREATE TABLE test (id INT PRIMARY KEY, value INT)")
    con.insert("test", {"id": [1, 2], "value": [11, 12]})
    t = con.table("test")
    assert t.schema() == ibis.Schema({"id": "!int32", "value": "int32"})

def test_insert_returns(existing_table, con):
    to_insert = ibis.memtable({"value": [23, 24], "id": [3, 4]})
    inserted = db.insert(con, "test", to_insert, return_inserted=True)
    assert inserted.schema() == con.table("test").schema()
    result = sorted(inserted.value.execute())
    assert result == [23, 24]

    # Do it again to make sure we aren't using some stale value
    to_insert = ibis.memtable({"value": [25, 26], "id": [5, 6]})
    inserted = db.insert(con, "test", to_insert, return_inserted=True)
    assert inserted.schema() == con.table("test").schema()
    result = sorted(inserted.value.execute())
    assert result == [25, 26]