Closed MarcoGorelli closed 2 months ago
Hi, thanks for opening this!
That is indeed a large slowdown - we'd expect a small constant-factor slowdown using ibis for most backends (negligible for realistic data sizes). The pandas backend is the exception here since it requires direct dataframe execution and lacks an internal query optimizer. Ibis currently doesn't do much internally to optimize user queries when run using the pandas
backend.
We generally don't recommend users use the pandas
backend itself - the duckdb
backend (our default backend) can operate natively on pandas dataframes (as well as polars dataframes or pyarrow tables), and is significantly faster and more memory efficient than even the native pandas code.
While we do maintain a pure-pandas backend, we haven't invested a ton of time into improving its performance since the duckdb
backend is generally a better local execution option in every metric. We'd happily accept PRs or diagnostic help for improving the efficiency here, but don't currently intend on doing much work on that front ourselves in the near future.
thanks for your reply
We generally don't recommend users use the pandas backend itself
just out of interest, why support it at all?
Ibis currently doesn't do much internally to optimize user queries when run using the pandas backend.
Just FYI, neither does Narwhals, yet it doesn't have any noticeable overhead when transpiling to pandas
just out of interest, why support it at all?
IMO mostly historical reasons at this point. It does work, so a use case of developing locally against the pandas
backend before deploying on some other system (e.g. bigquery
, ...) is valid and worth maintaining.
Given time, we (or ideally someone else :) ) would love to improve the backend to better optimize pandas queries. We now have the internals needed to support that, but haven't spent the time yet to finish the work. For a small team supporting 21 backends, the pandas
backend is currently low priority.
Just FYI, neither does Narwhals, yet it doesn't have any noticeable overhead when transpiling to pandas
That's great! There's likely some low-hanging fruit in ibis's pandas backend that we could fix to improve perf here.
Also, FWIW I think narwhals
is a nice option for library authors wanting to generically consume common dataframe-like inputs. If you'd be interested, I'd love to see ibis
as a possible narwhals
backend as well.
Makes sense, good to have priorities in order and not get distracted 👍
Thanks! Glad you get the spirit of the project 🙌 I did think about supporting Ibis directly, but got blocked by some operations not being supported https://github.com/ibis-project/ibis/issues/8080, so I decided to instead focus on dataframes. Hope I'm not getting too off-topic here 😄
This isn't incorrect behavior inside of Ibis, so I'm going to remove the bug
label.
Closing this as a wontfix. We aren't prioritizing the performance of the pandas backend.
Hey 👋 I've got a case where Ibis slows pandas down ~180x, are you interested in a reproducer or would that still be not planned?
What happened?
If I run pandas via Ibis, it's 4 times slower than running pandas directly. I think some overhead is expected, but not this much?
What version of ibis are you using?
9.0.0
What backend(s) are you using, if any?
pandas
Relevant log output
I've got a notebook here which reproduces this: https://www.kaggle.com/code/marcogorelli/narwhals-tpch-q1-s2?scriptVersionId=182594494
In short, it's:
vs
(note: it's the version I linked to, i.e. "version 26", that has the Ibis comparison - if you refresh the notebook and get a newer version, like version 27, then that one won't have Ibis)
I haven't published the comparison in public, don't worry
It just strikes me as not-intended behaviour, so I wanted to report it to help Ibis devs
Code of Conduct