Open MarcoGorelli opened 2 months ago
Thanks for opening this. The recursion error is definitely fixable (I have an idea how we might resolve this). Definitely worth fixing.
Please introduce an expression in Ibis which can efficiently make use of Polars' sum_horizontal
Are you asking for a new sum_horizontal
method in ibis? Or just for the above to be more efficient compilation-wise? If it's the former - we don't have a pattern for horizontal reductions right now, and I've never personally run into a use case where this comes up. Can you provide an example realistic motivating use case where summing horizontally across a large number of columns comes up? Thanks!
Or just for the above to be more efficient compilation-wise?
this is fine, anything that's as good as (or close to as good as) sum_horizontal
/ duckdb's native '+'
string concatenation works, thanks 🙏
Can you provide an example realistic motivating use case where summing horizontally across a large number of columns comes up? Thanks!
In my case it's in a dataset which has assets as rows and dates as columns (granted, one can unpivot and do calculations vertically, but we can't guarantee that that that's how users would do it). If perf isn't worth prioritising for this operation then no bother, but at least the recursion error would be good to see addressed 👍
The 128 limit is not an Ibis issue, it's a limit in DuckDB's expression parsing regardless of how Ibis parenthesizes expressions.
It's reproducible without Ibis. I'm not sure what you're testing.
Here'e a copy-pastable fully reproducible example:
import duckdb
import pandas as pd
df = pd.DataFrame({f"x{i}": [1] for i in range(128)})
sql = f"SELECT {'+'.join(df.columns)} AS sum FROM df"
duckdb.sql(sql)
Gives
Traceback (most recent call last):
File "/home/cloud/src/github.com/ibis/test.py", line 9, in <module>
duckdb.sql(sql)
File "/nix/store/rfhaw95gr6cgac88vxbkyg86r77la5ax-python3-3.10.14-env/lib/python3.10/site-packages/duckdb/__init__.py", line 457, in sql
return conn.sql(query, **kwargs)
duckdb.duckdb.BinderException: Binder Error: Maximum recursion depth exceeded (Maximum: 128) while binding "df.x0"
interesting, thanks - sorry I thought I'd tried that, I must've mixed up objects 😳
are you bothered about the perf of this operation or should we close?
The comparison isn't quite apples to apples, because we make every polars input lazy.
Here's a script that compares lazy polars, Ibis + DuckDB, DuckDB raw SQL:
from functools import reduce
from operator import add
import duckdb
import numpy as np
import polars as pl
import ibis
rng = np.random.default_rng(1)
arr = rng.integers(-100, 100, size=(100, 100))
df = pl.DataFrame(arr).lazy()
t = ibis.memtable(df)
pl_expr = df.select(pl.sum_horizontal("*"))
ibis_expr = t.select(SUM=reduce(add, map(t.__getitem__, t.columns)))
duckdb_expr = duckdb.sql(f"SELECT {'+'.join(df.columns)} AS sum FROM df")
Running this in IPython, there's still a notable performance difference between Ibis + DuckDB and DuckDB raw SQL:
In [2]: %timeit pl_expr.collect()
403 μs ± 2.99 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [3]: %timeit ibis_expr.to_polars()
72.6 ms ± 679 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit duckdb_expr.pl()
5.88 ms ± 109 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Ibis can't do better than DuckDB raw SQL. The gap between Polars and DuckDB would have to be closed on the DuckDB side.
That said, definitely interested in closing the Ibis + DuckDB -> DuckDB raw SQL gap.
cool, thanks for explaining!
Thanks @cpcloud for digging in a bit here (I'm responding by phone; no computer on vacation -> no way to validate the error :)). Agree there's not much we can do about a recursion error coming from duckdb itself!
Update: this has turned up a few performance bugs in the ibis expr -> sql
codepath that I've resolved (#10007, #10011). Here's what we see on main now (copied from https://github.com/ibis-project/ibis/pull/10011#issuecomment-2327800418):
from functools import reduce
from operator import add
from timeit import Timer
import pyarrow as pa
import ibis
con = ibis.duckdb.connect()
N = 100
data = pa.Table.from_pydict({f"x{i}": [1, 2, 3] for i in range(N)})
t = con.create_table("test", data)
expr = t.select(sum=reduce(add, map(t.__getitem__, t.columns)))
sql = f"SELECT {'+'.join(t.columns)} AS sum FROM test"
for name, func in [
("duckdb", lambda: con.raw_sql(sql).arrow()),
("ibis-duckdb", lambda: expr.to_pyarrow()),
]:
n, t = Timer(func).autorange()
print(f"{name}: {1000 * t / n:.1f} ms")
$ python bench.py
duckdb: 2.6 ms
ibis-duckdb: 7.3 ms
There's still some low hanging fruit for improving performance of generating the sum using reduce
(which basically forms a long linked list of IR nodes). Beyond that, I do think we'll add a specific operation for expressing rowwise aggregates for usability. For some databases there are more efficient ways of spelling this than just col1 + col2 + ...
, but for many that's what we'd fall back on. Rowwise aggregates still seem likely caused by poor data modelling, but adding a user shorthand is easy and does let us generate better code for backends where that would matter.
I'm going to leave this open to track some more perf work, and will open a new issue to propose a rowwise
method afterward.
What happened?
I ran this:
With 255 columns (as in the financial data I was trying this out with) I get
as reported here https://ibis-project.zulipchat.com/#narrow/stream/405263-general/topic/.E2.9C.94.20sum.28axis.3D1.29
Running duckdb natively I could just do
which is also fast (and doesn't crash on 255 columns)
Please introduce an expression in Ibis which can efficiently make use of Polars'
sum_horizontal
/ duckdb's'+'
What version of ibis are you using?
9.3.0
What backend(s) are you using, if any?
No response
Relevant log output
No response
Code of Conduct