ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.26k stars 594 forks source link

perf: recursion error when summing a large number of columns #9987

Open MarcoGorelli opened 2 months ago

MarcoGorelli commented 2 months ago

What happened?

I ran this:

In [1]: import polars as pl; import ibis; import numpy as np

In [2]: rng = np.random.default_rng(1)

In [3]: arr = rng.integers(-100, 100, size=(100, 100))

In [4]: df = pl.DataFrame(arr)

In [5]: t = ibis.memtable(df)

In [6]: results = %timeit -o df.select(pl.sum_horizontal('*'))
2.69 ms ± 72.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: results.best
Out[7]: 0.0025612837100015893

In [8]: from functools import reduce

In [9]: from operator import add

In [10]: results = %timeit -o t.select(SUM=reduce(add, map(t.__getitem__, t.columns))).to_polars()
60.3 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: results.best
Out[11]: 0.05732215999978507

With 255 columns (as in the financial data I was trying this out with) I get

Binder Error: Maximum recursion depth exceeded (Maximum: 128) while binding

as reported here https://ibis-project.zulipchat.com/#narrow/stream/405263-general/topic/.E2.9C.94.20sum.28axis.3D1.29

Running duckdb natively I could just do


In [19]: result = %timeit -o duckdb.sql(f'select {"+".join(df.columns)} as sum from df')
1.8 ms ± 69.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [20]: result.best
Out[20]: 0.0016851913000000423

which is also fast (and doesn't crash on 255 columns)

Please introduce an expression in Ibis which can efficiently make use of Polars' sum_horizontal / duckdb's '+'

What version of ibis are you using?

9.3.0

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

jcrist commented 2 months ago

Thanks for opening this. The recursion error is definitely fixable (I have an idea how we might resolve this). Definitely worth fixing.

Please introduce an expression in Ibis which can efficiently make use of Polars' sum_horizontal

Are you asking for a new sum_horizontal method in ibis? Or just for the above to be more efficient compilation-wise? If it's the former - we don't have a pattern for horizontal reductions right now, and I've never personally run into a use case where this comes up. Can you provide an example realistic motivating use case where summing horizontally across a large number of columns comes up? Thanks!

MarcoGorelli commented 2 months ago

Or just for the above to be more efficient compilation-wise?

this is fine, anything that's as good as (or close to as good as) sum_horizontal / duckdb's native '+' string concatenation works, thanks 🙏

MarcoGorelli commented 2 months ago

Can you provide an example realistic motivating use case where summing horizontally across a large number of columns comes up? Thanks!

In my case it's in a dataset which has assets as rows and dates as columns (granted, one can unpivot and do calculations vertically, but we can't guarantee that that that's how users would do it). If perf isn't worth prioritising for this operation then no bother, but at least the recursion error would be good to see addressed 👍

cpcloud commented 2 months ago

The 128 limit is not an Ibis issue, it's a limit in DuckDB's expression parsing regardless of how Ibis parenthesizes expressions.

It's reproducible without Ibis. I'm not sure what you're testing.

Here'e a copy-pastable fully reproducible example:

import duckdb
import pandas as pd

df = pd.DataFrame({f"x{i}": [1] for i in range(128)})

sql = f"SELECT {'+'.join(df.columns)} AS sum FROM df"
duckdb.sql(sql)

Gives

Traceback (most recent call last):
  File "/home/cloud/src/github.com/ibis/test.py", line 9, in <module>
    duckdb.sql(sql)
  File "/nix/store/rfhaw95gr6cgac88vxbkyg86r77la5ax-python3-3.10.14-env/lib/python3.10/site-packages/duckdb/__init__.py", line 457, in sql
    return conn.sql(query, **kwargs)
duckdb.duckdb.BinderException: Binder Error: Maximum recursion depth exceeded (Maximum: 128) while binding "df.x0"
MarcoGorelli commented 2 months ago

interesting, thanks - sorry I thought I'd tried that, I must've mixed up objects 😳

are you bothered about the perf of this operation or should we close?

cpcloud commented 2 months ago

The comparison isn't quite apples to apples, because we make every polars input lazy.

Here's a script that compares lazy polars, Ibis + DuckDB, DuckDB raw SQL:

from functools import reduce
from operator import add

import duckdb
import numpy as np
import polars as pl

import ibis

rng = np.random.default_rng(1)

arr = rng.integers(-100, 100, size=(100, 100))

df = pl.DataFrame(arr).lazy()

t = ibis.memtable(df)

pl_expr = df.select(pl.sum_horizontal("*"))

ibis_expr = t.select(SUM=reduce(add, map(t.__getitem__, t.columns)))

duckdb_expr = duckdb.sql(f"SELECT {'+'.join(df.columns)} AS sum FROM df")

Running this in IPython, there's still a notable performance difference between Ibis + DuckDB and DuckDB raw SQL:

In [2]: %timeit pl_expr.collect()
403 μs ± 2.99 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [3]: %timeit ibis_expr.to_polars()
72.6 ms ± 679 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit duckdb_expr.pl()
5.88 ms ± 109 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Ibis can't do better than DuckDB raw SQL. The gap between Polars and DuckDB would have to be closed on the DuckDB side.

That said, definitely interested in closing the Ibis + DuckDB -> DuckDB raw SQL gap.

MarcoGorelli commented 2 months ago

cool, thanks for explaining!

jcrist commented 2 months ago

Thanks @cpcloud for digging in a bit here (I'm responding by phone; no computer on vacation -> no way to validate the error :)). Agree there's not much we can do about a recursion error coming from duckdb itself!

jcrist commented 2 months ago

Update: this has turned up a few performance bugs in the ibis expr -> sql codepath that I've resolved (#10007, #10011). Here's what we see on main now (copied from https://github.com/ibis-project/ibis/pull/10011#issuecomment-2327800418):

from functools import reduce
from operator import add
from timeit import Timer

import pyarrow as pa

import ibis

con = ibis.duckdb.connect()

N = 100
data = pa.Table.from_pydict({f"x{i}": [1, 2, 3] for i in range(N)})
t = con.create_table("test", data)
expr = t.select(sum=reduce(add, map(t.__getitem__, t.columns)))
sql = f"SELECT {'+'.join(t.columns)} AS sum FROM test"

for name, func in [
    ("duckdb", lambda: con.raw_sql(sql).arrow()),
    ("ibis-duckdb", lambda: expr.to_pyarrow()),
]:
    n, t = Timer(func).autorange()
    print(f"{name}: {1000 * t / n:.1f} ms")
$ python bench.py
duckdb: 2.6 ms
ibis-duckdb: 7.3 ms

There's still some low hanging fruit for improving performance of generating the sum using reduce (which basically forms a long linked list of IR nodes). Beyond that, I do think we'll add a specific operation for expressing rowwise aggregates for usability. For some databases there are more efficient ways of spelling this than just col1 + col2 + ..., but for many that's what we'd fall back on. Rowwise aggregates still seem likely caused by poor data modelling, but adding a user shorthand is easy and does let us generate better code for backends where that would matter.

I'm going to leave this open to track some more perf work, and will open a new issue to propose a rowwise method afterward.