ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.33k stars 599 forks source link

bug: Column and Table comparison yields boolean output #10136

Open tokoko opened 2 months ago

tokoko commented 2 months ago

What happened?

Comparing Table and Column objects escapes ibis operator overrides and instead yields a boolean output. This is not necessarily wrong, but can lead to confusing errors when user is trying to use another table in a scalar subquery, but forgets to call .as_scalar method on the table. Instead of throwing an error, such a comparison will always be interpreted as a boolean literal false.

import ibis

con = ibis.connect("duckdb://")

con.create_table("orders", orders_pa)
con.create_table("stores", stores_pa)

print(
    ibis.to_sql(
        orders.select(
            orders["fk_store_id"] == stores.select("store_id").limit(1) # .as_scalar() call missing
        )
    )
)

What version of ibis are you using?

9.5.0

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

anjakefala commented 1 week ago

/take

anjakefala commented 1 week ago

Thank you for reporting this bug @tokoko, and for the start of a repro! I am going to investigate how the ibis over-ride escapes this.

@cpcloud What would be the expected behaviour? Would it be to error?

I put together a thorough repro:

import ibis
import pyarrow as pa

orders_data = pa.Table.from_pydict({
                "order_id": [1,2,3,4],
                "fk_store_id": [1,2,1,3],
                "amount": [100,200,150,300]
                })

stores_data = pa.Table.from_pydict({
    "store_id": [1,2,3],
    "name": ["Store A", "Store B", "Store C"]
    })

con = ibis.connect("duckdb://")
orders = con.create_table("orders", orders_data)
stores = con.create_table("stores", stores_data)

print("\nCase 1: Direct comparison")
expr = orders.select(
    orders["fk_store_id"],
    comparison_result=(orders["fk_store_id"] == stores.select("store_id").limit(1))
)
print("\nSQL generated:")
print(ibis.to_sql(expr))
print("\nResult:")
print(expr.execute())

print("\nCase 2: Correct usage with as_scalar()")
expr_correct = orders.select(
        orders["fk_store_id"],
        comparison_result=(orders["fk_store_id"] == stores.select("store_id").limit(1).as_scalar())
)
print("\nSQL generated:")
print(ibis.to_sql(expr_correct))
print("\nResult:")
print(expr_correct.execute())

# Case 3: What happens with different comparison types?
print("\nCase 3: Different comparison types")
# Column to literal
print("Column to literal:", type(orders["fk_store_id"] == 1))
# Column to Column
print("Column to Column:", type(orders["fk_store_id"] == orders["order_id"]))
# Column to Table
print("Column to Table:", type(orders["fk_store_id"] == stores.select("store_id").limit(1)))
# Column to Scalar
print("Column to Scalar:", type(orders["fk_store_id"] == stores.select("store_id").limit(1).as_scalar()))

This is the output:

Case 1: Direct comparison

SQL generated:
SELECT
  "t0"."fk_store_id",
  FALSE AS "comparison_result"
FROM "memory"."main"."orders" AS "t0"

Result:
   fk_store_id  comparison_result
0            1              False
1            2              False
2            1              False
3            3              False

Case 2: Correct usage with as_scalar()

SQL generated:
SELECT
  "t0"."fk_store_id",
  "t0"."fk_store_id" = (
    SELECT
      "t1"."store_id"
    FROM "memory"."main"."stores" AS "t1"
    LIMIT 1
  ) AS "comparison_result"
FROM "memory"."main"."orders" AS "t0"

Result:
   fk_store_id  comparison_result
0            1               True
1            2              False
2            1               True
3            3              False

Case 3: Different comparison types
Column to literal: <class 'ibis.expr.types.logical.BooleanColumn'>
Column to Column: <class 'ibis.expr.types.logical.BooleanColumn'>
Column to Table: <class 'bool'>
Column to Scalar: <class 'ibis.expr.types.logical.BooleanColumn'>

It seems only Columns compared with Tables is returning the boolean.

anjakefala commented 1 week ago

It seems that in _binop, op_class is running into a SignatureValidationError which results to falling back to the default Python ==.

-> def _binop(op_class: type[ops.Binary], left: ir.Value, right: ir.Value) -> ir.Value:                                                                                                    
 > /home/anja/git/ibis/ibis/expr/types/core.py(803)_binop()                                                                                                                                 
 -> try:                                                                                                                                                                                    
 > /home/anja/git/ibis/ibis/expr/types/core.py(804)_binop()                                                                                                                                 
 -> node = op_class(left, right)                                                                                                                                                            
ibis.common.annotations.SignatureValidationError: Equals(r0 := DatabaseTable: memory.main.orders                                                                                           
   order_id    int64                                                                                                                                                                        
   fk_store_id int64                                                                                                                                                                        
   amount      int64                                                                                                                                                                        

 fk_store_id: r0.fk_store_id, r0 := DatabaseTable: memory.main.stores                                                                                                                       
   store_id int64                                                                                                                                                                           
  name     string                                                                                                                                                                          

 r1 := Project[r0]                                                                                                                                                                          
   store_id: r0.store_id                                                                                                                                                                    

Limit[r1, n=1]) has failed due to the following errors:                                                                                                                                    
`right`: r0 := DatabaseTable: memory.main.stores                                                                                                                                         
 store_id int64                                                                                                                                                                           
  name     string                                                                                                                                                                          

 r1 := Project[r0]                                                                                                                                                                          
   store_id: r0.store_id                                                                                                                                                                    

 Limit[r1, n=1] is not coercible to a Value                                                                                                                                                 

 Expected signature: Equals(left: Value, right: Value)                                                                                                                                      
anjakefala commented 1 week ago

My interpreation is that Ibis tried to create an Equals operation. The Equals operation required both operands to be Value types. It can't automatically coerce a Table to a Value. So Python's == is short-circuiting the comparison, and we lose the informative error message. I'll try to understand where the short circuiting is happening.

Edit Probably because of:

> /home/anja/git/ibis/ibis/expr/types/core.py(805)_binop()                                                                                                                                      
-> except (ValidationError, NotImplementedError):                                                                                                                                               
> /home/anja/git/ibis/ibis/expr/types/core.py(806)_binop()                                                                                                                                      
-> return NotImplemented      

And by default when there's a NotImplemented, it defaults to Python's ==.

anjakefala commented 1 week ago

@cpcloud Should we be raising ValidationError instead of converting it into a NotImplemented?

cpcloud commented 6 days ago

I think for some cases NotImplemented might be necessary, but I would try raising the error in the __eq__ and __ne__ cases only and see if anything breaks.