Run with SMALL=True for testing, then SMALL=False to run with the original dataset (full size)
Anyone fancy translating to SQL so we could check DuckDB too? My intuition is that this wouldn't be DuckDB's forte - which is fine, DuckDB is incredibly good at many other things - I think that making a friendly comparison involving this kind of benchmark would give a more complete picture than "DuckDB scales better than Polars because TPC-H!"
The M5 Forecasting Competition was held on Kaggle in 2020, and top solutions generally featured a lot of heavy feature engineering
Doing that feature engineering in pandas was quite slow, so I'm benchmarking how much better Polars would have been at that task
I think this is good to benchmark, as:
I think this reflects the kinds of gains that people doing applied data science can expect from using Polars
Here's a notebook with the queries + data: https://www.kaggle.com/code/marcogorelli/m5-forecasting-feature-engineering-benchmark/notebook
Run with
SMALL=True
for testing, thenSMALL=False
to run with the original dataset (full size)Anyone fancy translating to SQL so we could check DuckDB too? My intuition is that this wouldn't be DuckDB's forte - which is fine, DuckDB is incredibly good at many other things - I think that making a friendly comparison involving this kind of benchmark would give a more complete picture than "DuckDB scales better than Polars because TPC-H!"