Closed MarcoGorelli closed 4 months ago
Aside from performance, there might be bug in the duckdb backend?
In [16]: add_lags_ibis_pandas(df_pl, ['a'], [1]) # looks correct
Out[16]:
a b c a_1
0 0.345584 0.164416 -0.407465 0.821618
1 0.821618 -0.562795 -0.062351 0.330437
2 0.330437 1.523051 0.519244 -1.303157
3 -1.303157 0.573010 -1.722321 0.905356
4 0.905356 0.629068 1.522420 0.446375
... ... ... ... ...
9999995 -1.844940 -1.460037 -0.420211 1.426008
9999996 1.426008 -1.786794 1.411938 -1.770208
9999997 -1.770208 -1.234044 1.061642 -0.293813
9999998 -0.293813 -1.985160 -0.134314 -0.870312
9999999 -0.870312 -1.305430 -0.965417 NaN
[10000000 rows x 4 columns]
In [17]: add_lags_ibis_polars(df_pl, ['a'], [1]) # what happened here?
Out[17]:
shape: (10_000_000, 4)
┌───────────┬───────────┬───────────┬───────────┐
│ a ┆ b ┆ c ┆ a_1 │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ 0.008919 ┆ 0.641422 ┆ 0.376592 ┆ 0.927814 │
│ 0.927814 ┆ -1.595148 ┆ -1.321503 ┆ -0.216802 │
│ -0.216802 ┆ 0.163541 ┆ -0.839665 ┆ 0.627737 │
│ 0.627737 ┆ 0.199048 ┆ -0.055465 ┆ 2.124896 │
│ 2.124896 ┆ -0.484083 ┆ -1.280158 ┆ -0.120967 │
│ … ┆ … ┆ … ┆ … │
│ -0.156736 ┆ -0.267561 ┆ -0.548102 ┆ -1.527322 │
│ -1.527322 ┆ -0.172108 ┆ -0.840629 ┆ 0.883683 │
│ 0.883683 ┆ -1.140378 ┆ -1.907993 ┆ -0.471029 │
│ -0.471029 ┆ -0.854808 ┆ -0.36428 ┆ 2.66472 │
│ 2.66472 ┆ 0.497843 ┆ -0.089645 ┆ 0.050035 │
└───────────┴───────────┴───────────┴───────────┘
Regarding the bug: the results are correct up until N = 8192
(2**13
). For N = 8193
onwards, it breaks
There's no guaranteed ordering of output results without an explicit call to order_by
.
If the 8193rd element differs, that's highly indicative that 8192 is some kind of buffer size for a unit of processing inside duckdb.
OK thanks, I've updated the example to sort by index afterwards, so that the input order is preserved at the end for the user. But I'll keep sorting out of the benchmark
On Zulip you wrote
That last row doesn't look right. Can you open an issue about it with a fully reproducible example?
Is that still the case? Does this require investigation on your end, or is just the kind of operation which one might expect to be slow?
It requires some investigation.
Can you show some concrete numbers for the 10,000x?
I see about 280x using this script:
Also the polars performance varies by almost a factor of ~10~ 5 between runs, not sure how to factor that in here!
I suspect I am now hitting the OS page cache, and the first run wasn't :)
Can you show some concrete numbers for the 10,000x?
They're in the notebook https://www.kaggle.com/code/marcogorelli/narwhals-could-you-just-use-ibis-instead, which you can fork and run if you like
The timings come from running it on Kaggle, running it locally may differ
There, I report on the minimum, maximum, and average times:
Polars native
min: 0.0007532539999980751
max: 0.044216634333338334
0.006996648095236782 +/- 0.005743185736385091
Polars via Narwhals
min: 0.0009312336666577418
max: 0.00101599099999324
0.0009707789999966041 +/- 9.109105595347205e-06
Polars dataframe, Ibis with duckdb backend
min: 13.30594426766667
max: 13.773238605666657
13.551140615047618 +/- 0.05954104203620545
pandas native
min: 1.535507855999981
max: 1.6537581420000151
1.5739719158095309 +/- 0.01732459334764224
pandas via Narwhals
min: 0.5989258073333303
max: 0.8326186953333187
0.6507151033333359 +/- 0.0285118691222816
pandas dataframe, Ibis pandas backend
min: 2.6033352826666487
max: 3.050078681666681
2.681404633904755 +/- 0.05706747288901051
The numbers in the table at the bottom are the minimum, as my understanding is that that's the best practice when benchmarking https://docs.python.org/3/library/timeit.html#module-timeit
Note: It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.
In general when making these comparisons it would be really helpful if you could compare with the duckdb SQL API, so that we can rule out Ibis as the source of the problem here and funnel the report upstream.
Fascinating results I'm seeing:
I gave duckdb every advantage here:
I'm going to repurpose this issue for the ibis/duckdb discrepancy and open up an issue on the duckdb tracker for the lag performance.
That said, I think this is a weird benchmark, it's not really representative of any particular workload. Data are hardly ever normally distributed randomly generated numbers.
Thanks for your response 🙏
not really representative of any particular workload
The M5 forecasting competition was pretty much all lagged features, I'd say that this is a very common operation in practical machine learning and forecasting https://www.kaggle.com/competitions/m5-forecasting-accuracy/discussion/163684
Data are hardly ever normally distributed randomly generated numbers
It's the efficiency shift
operator that makes the difference here, rather than the distribution of the input values
As noted on Zulip, I posted this in order to answer the question which was posed to me by someone at Voltron: "why don't you just use Ibis?", after I'd presented Narwhals and how scikit-lego is using it at PyCon Italy. I tried to answer that question by taking a scikit-lego function, making some random data, and timing the relative overheads of Narwhals vs Ibis.
Personally, I think this answers the question. If you like to suggest a better or different benchmark (which still addresses the question), I'd be more than happy to time that too 🤗
Or we can loop in the person who asked the question if they meant something different by "why don't you just use Ibis?" - the ball was thrown in my court, I'm just trying to respond 😄
The M5 forecasting competition was pretty much all lagged features, I'd say that this is a very common operation in practical machine learning and forecasting
Indeed, lagging itself is a common operation for sure!
Personally, I think this answers the question. If you like to suggest a better or different benchmark (which still addresses the question), I'd be more than happy to time that too
I still think framing this as an Ibis vs Narwhals is misleading.
The performance difference observed has basically nothing to do with Ibis, it's concentrated in DuckDB, so this is about Polars and DuckDB, not Ibis and Narwhals.
That said, it's up to you. I'd just ask that you link to this issue so folks get the full context.
The purpose of Narwhals is to enable library maintainers to write functions such as:
When I was asked "why not use Ibis", they presumably meant "why not use Ibis for step 2"? Would you agree?
If so, do you have a better suggestion for how to answer your colleague's question?
I'd just ask that you link to this issue so folks get the full context.
Sure 😇
Presumably if narwhals had duckdb support then this question wouldn't be that interesting to answer, because you'd see basically the same thing you'd see with Ibis.
The only not-misleading answer here IMO is "because Ibis doesn't yet support Polars' lag operation"
DuckDB is out-of-scope for Narwhals, happy to leave that to you!
The only not-misleading answer here IMO is "because Ibis doesn't yet support Polars' lag operation"
that, and that it makes pandas noticeably slower, so it would have an impact on the existing pandas user base 😉
I won't publish this in public anyway (at least, not the duckdb part)
that, and that it makes pandas noticeably slower
Yep, and we've already addressed that elsewhere.
Thanks! Like you, I'm a fan of DuckDB, so if the end result is that things improve in DuckDB, that's a win for everyone
so this is about Polars and DuckDB
I'll just keep this result in my backpocket then in case a "Ibis is faster than Polars"-style post gets published then 😉
Thanks for engaging, have a nice day! 🌞
Closing this out as it's being tracked in https://github.com/duckdb/duckdb/issues/12600.
What happened?
I reported this here, and was asked to open an issue
What version of ibis are you using?
9.1.0
What backend(s) are you using, if any?
duckdb
Relevant log output
The notebook where I noticed this is here
Here's a smaller snippet to reproduce:
and then just run
and
Code of Conduct