ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.3k stars 597 forks source link

perf(duckdb): duckdb backend 10,000x slower than Polars for lagged operations #9405

Closed MarcoGorelli closed 4 months ago

MarcoGorelli commented 5 months ago

What happened?

I reported this here, and was asked to open an issue

What version of ibis are you using?

9.1.0

What backend(s) are you using, if any?

duckdb

Relevant log output

The notebook where I noticed this is here

Here's a smaller snippet to reproduce:

import pandas as pd
import polars as pl
import numpy as np
import narwhals as nw
import ibis

rng = np.random.default_rng(1)
N = 10_000_000
a = rng.normal(size=N)
b = rng.normal(size=N)
c = rng.normal(size=N)
data = {'a': a, 'b': b, 'c': c}
df_pl = pl.DataFrame(data)
df_pd = pd.DataFrame(data)

def add_lags_pandas(df, cols, lags):
    return pd.concat(
        [
            df,
            *[
                df.loc[:, col].shift(-lag).rename(f'{col}_{lag}')
                for col in cols
                for lag in lags
            ],
        ],
        axis=1,
    )

def add_lags_polars(df, cols, lags):
    return df.with_columns(
        pl.col(col).shift(-lag).alias(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )

@nw.narwhalify
def add_lags_narwhals(df, cols, lags):
    return df.with_columns(
        nw.col(col).shift(-lag).alias(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )

def add_lags_ibis_pandas(df, cols, lags):
    ibis.set_backend('pandas')
    t = ibis.memtable(df)
    t = t.mutate(
        ibis._[col].lag(-lag).name(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )
    return t.to_pandas()

def add_lags_ibis_polars(df, cols, lags):
    # Polars backend does not support this operation
    ibis.set_backend('duckdb')
    t = ibis.memtable(df.with_row_index())
    t = t.mutate(
        ibis._[col].lag(-lag).name(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )
    return t.to_polars().sort('index').drop('index')

and then just run

add_lags_ibis_polars(df_pl, ['a', 'b', 'c'], [1,2,3,4,5])

and

add_lags_polars(df_pl, ['a', 'b', 'c'], [1,2,3,4,5])

Code of Conduct

MarcoGorelli commented 5 months ago

Aside from performance, there might be bug in the duckdb backend?


In [16]: add_lags_ibis_pandas(df_pl, ['a'], [1])  # looks correct
Out[16]:
                a         b         c       a_1
0        0.345584  0.164416 -0.407465  0.821618
1        0.821618 -0.562795 -0.062351  0.330437
2        0.330437  1.523051  0.519244 -1.303157
3       -1.303157  0.573010 -1.722321  0.905356
4        0.905356  0.629068  1.522420  0.446375
...           ...       ...       ...       ...
9999995 -1.844940 -1.460037 -0.420211  1.426008
9999996  1.426008 -1.786794  1.411938 -1.770208
9999997 -1.770208 -1.234044  1.061642 -0.293813
9999998 -0.293813 -1.985160 -0.134314 -0.870312
9999999 -0.870312 -1.305430 -0.965417       NaN

[10000000 rows x 4 columns]

In [17]: add_lags_ibis_polars(df_pl, ['a'], [1])  # what happened here?
Out[17]:
shape: (10_000_000, 4)
┌───────────┬───────────┬───────────┬───────────┐
│ a         ┆ b         ┆ c         ┆ a_1       │
│ ---       ┆ ---       ┆ ---       ┆ ---       │
│ f64       ┆ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ 0.008919  ┆ 0.641422  ┆ 0.376592  ┆ 0.927814  │
│ 0.927814  ┆ -1.595148 ┆ -1.321503 ┆ -0.216802 │
│ -0.216802 ┆ 0.163541  ┆ -0.839665 ┆ 0.627737  │
│ 0.627737  ┆ 0.199048  ┆ -0.055465 ┆ 2.124896  │
│ 2.124896  ┆ -0.484083 ┆ -1.280158 ┆ -0.120967 │
│ …         ┆ …         ┆ …         ┆ …         │
│ -0.156736 ┆ -0.267561 ┆ -0.548102 ┆ -1.527322 │
│ -1.527322 ┆ -0.172108 ┆ -0.840629 ┆ 0.883683  │
│ 0.883683  ┆ -1.140378 ┆ -1.907993 ┆ -0.471029 │
│ -0.471029 ┆ -0.854808 ┆ -0.36428  ┆ 2.66472   │
│ 2.66472   ┆ 0.497843  ┆ -0.089645 ┆ 0.050035  │
└───────────┴───────────┴───────────┴───────────┘
MarcoGorelli commented 5 months ago

Regarding the bug: the results are correct up until N = 8192 (2**13). For N = 8193 onwards, it breaks

cpcloud commented 5 months ago

There's no guaranteed ordering of output results without an explicit call to order_by.

If the 8193rd element differs, that's highly indicative that 8192 is some kind of buffer size for a unit of processing inside duckdb.

MarcoGorelli commented 5 months ago

OK thanks, I've updated the example to sort by index afterwards, so that the input order is preserved at the end for the user. But I'll keep sorting out of the benchmark

On Zulip you wrote

That last row doesn't look right. Can you open an issue about it with a fully reproducible example?

Is that still the case? Does this require investigation on your end, or is just the kind of operation which one might expect to be slow?

cpcloud commented 5 months ago

It requires some investigation.

Can you show some concrete numbers for the 10,000x?

I see about 280x using this script:

script with `time` ```python import pandas as pd import time import polars as pl import numpy as np import ibis rng = np.random.default_rng(1) N = 10_000_000 a = rng.normal(size=N) b = rng.normal(size=N) c = rng.normal(size=N) data = {'a': a, 'b': b, 'c': c} df_pl = pl.DataFrame(data) df_pd = pd.DataFrame(data) def add_lags_pandas(df, cols, lags): return pd.concat( [ df, *[ df.loc[:, col].shift(-lag).rename(f'{col}_{lag}') for col in cols for lag in lags ], ], axis=1, ) def add_lags_polars(df, cols, lags): return df.with_columns( pl.col(col).shift(-lag).alias(f'{col}_{lag}') for col in cols for lag in lags ) def add_lags_ibis_pandas(df, cols, lags): ibis.set_backend('pandas') t = ibis.memtable(df) t = t.mutate( ibis._[col].lag(-lag).name(f'{col}_{lag}') for col in cols for lag in lags ) return t.to_pandas() def add_lags_ibis_polars(df, cols, lags): # Polars backend does not support this operation ibis.set_backend('duckdb') t = ibis.memtable(df) t = t.mutate( ibis._[col].lag(-lag).name(f'{col}_{lag}') for col in cols for lag in lags ) return t.to_polars() start = time.time() add_lags_polars(df_pl, ['a', 'b', 'c'], [1,2,3,4,5]) stop = time.time() print(f"polars: {stop-start:.2f}s") start = time.time() add_lags_ibis_polars(df_pl, ['a', 'b', 'c'], [1,2,3,4,5]) stop = time.time() print(f"ibis: {stop-start:.2f}s") ```
cpcloud commented 5 months ago

Also the polars performance varies by almost a factor of ~10~ 5 between runs, not sure how to factor that in here!

cpcloud commented 5 months ago

I suspect I am now hitting the OS page cache, and the first run wasn't :)

MarcoGorelli commented 5 months ago

Can you show some concrete numbers for the 10,000x?

They're in the notebook https://www.kaggle.com/code/marcogorelli/narwhals-could-you-just-use-ibis-instead, which you can fork and run if you like

The timings come from running it on Kaggle, running it locally may differ

There, I report on the minimum, maximum, and average times:

Polars native
min: 0.0007532539999980751
max: 0.044216634333338334
0.006996648095236782 +/- 0.005743185736385091
Polars via Narwhals
min: 0.0009312336666577418
max: 0.00101599099999324
0.0009707789999966041 +/- 9.109105595347205e-06
Polars dataframe, Ibis with duckdb backend
min: 13.30594426766667
max: 13.773238605666657
13.551140615047618 +/- 0.05954104203620545
pandas native
min: 1.535507855999981
max: 1.6537581420000151
1.5739719158095309 +/- 0.01732459334764224
pandas via Narwhals
min: 0.5989258073333303
max: 0.8326186953333187
0.6507151033333359 +/- 0.0285118691222816
pandas dataframe, Ibis pandas backend
min: 2.6033352826666487
max: 3.050078681666681
2.681404633904755 +/- 0.05706747288901051

The numbers in the table at the bottom are the minimum, as my understanding is that that's the best practice when benchmarking https://docs.python.org/3/library/timeit.html#module-timeit

Note: It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.

cpcloud commented 5 months ago

In general when making these comparisons it would be really helpful if you could compare with the duckdb SQL API, so that we can rule out Ibis as the source of the problem here and funnel the report upstream.

cpcloud commented 5 months ago

Fascinating results I'm seeing:

  1. polars is definitely the fastest here on this operation
  2. duckdb with raw sql is taking up most of the time, with Ibis + duckdb about 2x slower than duckdb alone

I gave duckdb every advantage here:

  1. not timing the connection
  2. creating a table in the native format from polars instead of using their magic view functionality.

I'm going to repurpose this issue for the ibis/duckdb discrepancy and open up an issue on the duckdb tracker for the lag performance.

That said, I think this is a weird benchmark, it's not really representative of any particular workload. Data are hardly ever normally distributed randomly generated numbers.

MarcoGorelli commented 5 months ago

Thanks for your response 🙏

not really representative of any particular workload

The M5 forecasting competition was pretty much all lagged features, I'd say that this is a very common operation in practical machine learning and forecasting https://www.kaggle.com/competitions/m5-forecasting-accuracy/discussion/163684

Data are hardly ever normally distributed randomly generated numbers

It's the efficiency shift operator that makes the difference here, rather than the distribution of the input values

As noted on Zulip, I posted this in order to answer the question which was posed to me by someone at Voltron: "why don't you just use Ibis?", after I'd presented Narwhals and how scikit-lego is using it at PyCon Italy. I tried to answer that question by taking a scikit-lego function, making some random data, and timing the relative overheads of Narwhals vs Ibis.

Personally, I think this answers the question. If you like to suggest a better or different benchmark (which still addresses the question), I'd be more than happy to time that too 🤗

Or we can loop in the person who asked the question if they meant something different by "why don't you just use Ibis?" - the ball was thrown in my court, I'm just trying to respond 😄

cpcloud commented 5 months ago

The M5 forecasting competition was pretty much all lagged features, I'd say that this is a very common operation in practical machine learning and forecasting

Indeed, lagging itself is a common operation for sure!

Personally, I think this answers the question. If you like to suggest a better or different benchmark (which still addresses the question), I'd be more than happy to time that too

I still think framing this as an Ibis vs Narwhals is misleading.

The performance difference observed has basically nothing to do with Ibis, it's concentrated in DuckDB, so this is about Polars and DuckDB, not Ibis and Narwhals.

cpcloud commented 5 months ago

That said, it's up to you. I'd just ask that you link to this issue so folks get the full context.

MarcoGorelli commented 5 months ago

The purpose of Narwhals is to enable library maintainers to write functions such as:

  1. pandas/Polars/etc goes in
  2. something happens under the hood
  3. the user gets pandas/Polars/etc back

When I was asked "why not use Ibis", they presumably meant "why not use Ibis for step 2"? Would you agree?

If so, do you have a better suggestion for how to answer your colleague's question?

I'd just ask that you link to this issue so folks get the full context.

Sure 😇

cpcloud commented 5 months ago

Presumably if narwhals had duckdb support then this question wouldn't be that interesting to answer, because you'd see basically the same thing you'd see with Ibis.

cpcloud commented 5 months ago

The only not-misleading answer here IMO is "because Ibis doesn't yet support Polars' lag operation"

MarcoGorelli commented 5 months ago

DuckDB is out-of-scope for Narwhals, happy to leave that to you!

The only not-misleading answer here IMO is "because Ibis doesn't yet support Polars' lag operation"

that, and that it makes pandas noticeably slower, so it would have an impact on the existing pandas user base 😉

I won't publish this in public anyway (at least, not the duckdb part)

cpcloud commented 5 months ago

that, and that it makes pandas noticeably slower

Yep, and we've already addressed that elsewhere.

cpcloud commented 5 months ago

Created the DuckDB issue here.

MarcoGorelli commented 5 months ago

Thanks! Like you, I'm a fan of DuckDB, so if the end result is that things improve in DuckDB, that's a win for everyone

so this is about Polars and DuckDB

I'll just keep this result in my backpocket then in case a "Ibis is faster than Polars"-style post gets published then 😉

Thanks for engaging, have a nice day! 🌞

cpcloud commented 4 months ago

Closing this out as it's being tracked in https://github.com/duckdb/duckdb/issues/12600.