ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.26k stars 593 forks source link

bug: IntegrityError when chaining calls to `outer_join` #10293

Open drin opened 3 weeks ago

drin commented 3 weeks ago

What happened?

When executing the following function chain (pseudocode):

base_aggrs = [
    ibis.Table(py_table).group_by().aggregate()
    for py_table in (py_table0, ..., py_table3)
]

query = base_aggrs[0].outer_join(base_aggrs[1], ...).select(...)
                     .outer_join(base_aggrs[2], ...).select(...)

The error I get is:

*** ibis.common.exceptions.IntegrityError: Cannot add <ibis.expr.operations.logical.Equals object at 0x10dd21650> to projection, they belong to another relation

The first outer_join results in an ibis.Table (<class 'ibis.expr.types.relations.Table'>), and I would expect the chain to continually produce an ibis.Table.

Note that I am actually doing this in a loop as seen here (though the code I'm executing has been updated to use just ibis.table instead of a pandas connection).

What version of ibis are you using?

python 3.12.7 (Python 3.12.7 (main, Oct 1 2024, 02:05:46) [Clang 16.0.0 (clang-1600.0.26.3)] on darwin) ibis-framework[duckdb]==9.5.0

>>> import ibis
>>> ibis.__version__
'9.5.0'

What backend(s) are you using, if any?

DuckDB

Relevant log output (pdb excerpts)

I am able to successfully run one iteration of outer_join as follows:

convenience function:

def AggregateJoin(left_table, right_table):
    return (
        left_table.outer_join(right_table, left_table.gene_id == right_table.gene_id)
                  .select(
                        ibis.coalesce(left_table.gene_id    , right_table.gene_id   ).name('gene_id')
                       ,             (left_table.cell_count + right_table.cell_count).name('cell_count')
                       ,             (left_table.expr_total + right_table.expr_total).name('expr_total')
                   )
    )

Debugging (left_table):

>>> left_table
r0 := InMemoryTable
  data:
    PyArrowTableProxy:
      pyarrow.Table
      gene_id: string
      cell_id: string
      expression: float
      ----
      gene_id: [["ENSG00000004455","ENSG00000004059","ENSG00000003756","ENSG00000003436","ENSG00000003402"]]
      cell_id: [["SRR5766151","SRR5766151","SRR5766151","SRR5766151","SRR5766151"]]
      expression: [[1,397,1,6,67.35789]]

Aggregate[r0]
  groups:
    gene_id: r0.gene_id
  metrics:
    cell_count: CountStar(r0)
    expr_total: Sum(r0.expression)

Debugging (right_table):

>>> right_table
r0 := InMemoryTable
  data:
    PyArrowTableProxy:
      pyarrow.Table
      gene_id: string
      cell_id: string
      expression: float
      ----
      gene_id: [["ENSG00000003147","ENSG00000002746","ENSG00000002586","ENSG00000001460","ENSG00000000457"]]
      cell_id: [["SRR5766151","SRR5766151","SRR5766151","SRR5766151","SRR5766151"]]
      expression: [[489,3,1755,1,4.282037]]

Aggregate[r0]
  groups:
    gene_id: r0.gene_id
  metrics:
    cell_count: CountStar(r0)
    expr_total: Sum(r0.expression)

Debugging (result):

>>> query = AggregateJoin(t1, t2)
>>> query
r0 := InMemoryTable
  data:
    PyArrowTableProxy:
      pyarrow.Table
      gene_id: string
      cell_id: string
      expression: float
      ----
      gene_id: [["ENSG00000004455","ENSG00000004059","ENSG00000003756","ENSG00000003436","ENSG00000003402"]]
      cell_id: [["SRR5766151","SRR5766151","SRR5766151","SRR5766151","SRR5766151"]]
      expression: [[1,397,1,6,67.35789]]

r1 := InMemoryTable
  data:
    PyArrowTableProxy:
      pyarrow.Table
      gene_id: string
      cell_id: string
      expression: float
      ----
      gene_id: [["ENSG00000003147","ENSG00000002746","ENSG00000002586","ENSG00000001460","ENSG00000000457"]]
      cell_id: [["SRR5766151","SRR5766151","SRR5766151","SRR5766151","SRR5766151"]]
      expression: [[489,3,1755,1,4.282037]]

r2 := Aggregate[r0]
  groups:
    gene_id: r0.gene_id
  metrics:
    cell_count: CountStar(r0)
    expr_total: Sum(r0.expression)

r3 := Aggregate[r1]
  groups:
    gene_id: r1.gene_id
  metrics:
    cell_count: CountStar(r1)
    expr_total: Sum(r1.expression)

JoinChain[r2]
  JoinLink[outer, r3]
    r2.gene_id == r3.gene_id
  values:
    gene_id:    Coalesce([r2.gene_id, r3.gene_id])
    cell_count: r2.cell_count + r3.cell_count
    expr_total: r2.expr_total + r3.expr_total

Then, a second iteration throws the error, as follows.

Debugging (t3):

>>> t3
r0 := InMemoryTable
  data:
    PyArrowTableProxy:
      pyarrow.Table
      gene_id: string
      cell_id: string
      expression: float
      ----
      gene_id: [["ENSG00000285991","ENSG00000285920","ENSG00000285733","ENSG00000285721","ENSG00000285629"]]
      cell_id: [["SRR5765852","SRR5765852","SRR5765852","SRR5765852","SRR5765852"]]
      expression: [[1,2,1.0522599,1.3091263,1.347101]]

Aggregate[r0]
  groups:
    gene_id: r0.gene_id
  metrics:
    cell_count: CountStar(r0)
    expr_total: Sum(r0.expression)

The actual error:

>>> AggregateJoin(query, t3)
*** ibis.common.exceptions.IntegrityError: Cannot add <ibis.expr.operations.logical.Equals object at 0x109f00bd0> to projection, they belong to another relation

Code of Conduct

drin commented 3 weeks ago

I'm not sure if this is actually a bug or if something I used to do is no longer valid.

Note that this code was working prior to the large ibis rewrite, but I have tried to update to latest version of ibis and dropping the use of ibis_conn = ibis.pandas.connect({}). Now, instead of getting a pyarrow table via ibis_conn.table() I'm using ibis.memtable(<pyarrow.Table>, name='some_name').

If any other context is needed on this, please let me know!

cpcloud commented 2 days ago

Can you please make the reproducer copypastable?

drin commented 2 days ago

sure, I can do that by end of day tomorrow.