ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.08k stars 586 forks source link

perf(pandas): "pandas via ibis" is 4 times slower than running pandas directly #9345

Closed MarcoGorelli closed 2 months ago

MarcoGorelli commented 3 months ago

What happened?

If I run pandas via Ibis, it's 4 times slower than running pandas directly. I think some overhead is expected, but not this much?

What version of ibis are you using?

9.0.0

What backend(s) are you using, if any?

pandas

Relevant log output

I've got a notebook here which reproduces this: https://www.kaggle.com/code/marcogorelli/narwhals-tpch-q1-s2?scriptVersionId=182594494

In short, it's:

def q1_ibis(lineitem):
    var1 = datetime(1998, 9, 2)
    lineitem = lineitem.mutate(
        disc_price=lineitem["l_extendedprice"] * (1 - lineitem["l_discount"]),
        charge=(
            lineitem["l_extendedprice"]
            * (1.0 - lineitem["l_discount"])
            * (1.0 + lineitem["l_tax"])
        ),
    )
    q_final = (
        lineitem.filter(lineitem["l_shipdate"] <= var1)
        .group_by(["l_returnflag", "l_linestatus"])
        .aggregate(
            sum_qty=lineitem["l_quantity"].sum(),
            sum_base_price=lineitem["l_extendedprice"].sum(),
            sum_disc_price=(lineitem['disc_price'].sum()),
            sum_charge=(lineitem['charge'].sum()),
            avg_qty=lineitem["l_quantity"].mean(),
            avg_price=lineitem["l_extendedprice"].mean(),
            avg_disc=lineitem["l_discount"].mean(),
            count_order=lambda lineitem: lineitem.count(),
        )
        .order_by(["l_returnflag", "l_linestatus"])
    )
    return q_final.to_pandas()

con = ibis.pandas.connect()
q1_ibis(con.read_parquet(lineitem, engine='pyarrow', dtype_backend='pyarrow'))

vs

from datetime import date

def q1_pandas_native(lineitem):
    VAR1 = date(1998, 9, 2)

    sel = lineitem.l_shipdate <= VAR1
    lineitem_filtered = lineitem[sel]

    # This is lenient towards pandas as normally an optimizer should decide
    # that this could be computed before the groupby aggregation.
    # Other implementations don't enjoy this benefit.
    lineitem_filtered["disc_price"] = lineitem_filtered.l_extendedprice * (
        1 - lineitem_filtered.l_discount
    )
    lineitem_filtered["charge"] = (
        lineitem_filtered.l_extendedprice
        * (1 - lineitem_filtered.l_discount)
        * (1 + lineitem_filtered.l_tax)
    )
    gb = lineitem_filtered.groupby(["l_returnflag", "l_linestatus"], as_index=False)

    total = gb.agg(
        sum_qty=pd.NamedAgg(column="l_quantity", aggfunc="sum"),
        sum_base_price=pd.NamedAgg(column="l_extendedprice", aggfunc="sum"),
        sum_disc_price=pd.NamedAgg(column="disc_price", aggfunc="sum"),
        sum_charge=pd.NamedAgg(column="charge", aggfunc="sum"),
        avg_qty=pd.NamedAgg(column="l_quantity", aggfunc="mean"),
        avg_price=pd.NamedAgg(column="l_extendedprice", aggfunc="mean"),
        avg_disc=pd.NamedAgg(column="l_discount", aggfunc="mean"),
        count_order=pd.NamedAgg(column="l_orderkey", aggfunc="size"),
    )

    result_df = total.sort_values(["l_returnflag", "l_linestatus"])

    return result_df  # type: ignore[no-any-return]
q1_pandas_native(pd.read_parquet(lineitem, engine='pyarrow', dtype_backend='pyarrow'))

(note: it's the version I linked to, i.e. "version 26", that has the Ibis comparison - if you refresh the notebook and get a newer version, like version 27, then that one won't have Ibis)

I haven't published the comparison in public, don't worry

It just strikes me as not-intended behaviour, so I wanted to report it to help Ibis devs

Code of Conduct

jcrist commented 3 months ago

Hi, thanks for opening this!

That is indeed a large slowdown - we'd expect a small constant-factor slowdown using ibis for most backends (negligible for realistic data sizes). The pandas backend is the exception here since it requires direct dataframe execution and lacks an internal query optimizer. Ibis currently doesn't do much internally to optimize user queries when run using the pandas backend.

We generally don't recommend users use the pandas backend itself - the duckdb backend (our default backend) can operate natively on pandas dataframes (as well as polars dataframes or pyarrow tables), and is significantly faster and more memory efficient than even the native pandas code.

While we do maintain a pure-pandas backend, we haven't invested a ton of time into improving its performance since the duckdb backend is generally a better local execution option in every metric. We'd happily accept PRs or diagnostic help for improving the efficiency here, but don't currently intend on doing much work on that front ourselves in the near future.

MarcoGorelli commented 3 months ago

thanks for your reply

We generally don't recommend users use the pandas backend itself

just out of interest, why support it at all?

Ibis currently doesn't do much internally to optimize user queries when run using the pandas backend.

Just FYI, neither does Narwhals, yet it doesn't have any noticeable overhead when transpiling to pandas

jcrist commented 3 months ago

just out of interest, why support it at all?

IMO mostly historical reasons at this point. It does work, so a use case of developing locally against the pandas backend before deploying on some other system (e.g. bigquery, ...) is valid and worth maintaining.

Given time, we (or ideally someone else :) ) would love to improve the backend to better optimize pandas queries. We now have the internals needed to support that, but haven't spent the time yet to finish the work. For a small team supporting 21 backends, the pandas backend is currently low priority.

Just FYI, neither does Narwhals, yet it doesn't have any noticeable overhead when transpiling to pandas

That's great! There's likely some low-hanging fruit in ibis's pandas backend that we could fix to improve perf here.

Also, FWIW I think narwhals is a nice option for library authors wanting to generically consume common dataframe-like inputs. If you'd be interested, I'd love to see ibis as a possible narwhals backend as well.

MarcoGorelli commented 3 months ago

Makes sense, good to have priorities in order and not get distracted 👍

Thanks! Glad you get the spirit of the project 🙌 I did think about supporting Ibis directly, but got blocked by some operations not being supported https://github.com/ibis-project/ibis/issues/8080, so I decided to instead focus on dataframes. Hope I'm not getting too off-topic here 😄

cpcloud commented 3 months ago

This isn't incorrect behavior inside of Ibis, so I'm going to remove the bug label.

cpcloud commented 2 months ago

Closing this as a wontfix. We aren't prioritizing the performance of the pandas backend.

MarcoGorelli commented 2 months ago

Hey 👋 I've got a case where Ibis slows pandas down ~180x, are you interested in a reproducer or would that still be not planned?