Time Series benchmark - Githubissues

The M5 Forecasting Competition was held on Kaggle in 2020, and top solutions generally featured a lot of heavy feature engineering

Doing that feature engineering in pandas was quite slow, so I'm benchmarking how much better Polars would have been at that task

I think this is good to benchmark, as:

the competition was run on real-world Walmart data
the operations we're benchmarking are from the winning solution, so evidently they were doing something right

I think this reflects the kinds of gains that people doing applied data science can expect from using Polars

Here's a notebook with the queries + data: https://www.kaggle.com/code/marcogorelli/m5-forecasting-feature-engineering-benchmark/notebook

Run with SMALL=True for testing, then SMALL=False to run with the original dataset (full size)

Anyone fancy translating to SQL so we could check DuckDB too? My intuition is that this wouldn't be DuckDB's forte - which is fine, DuckDB is incredibly good at many other things - I think that making a friendly comparison involving this kind of benchmark would give a more complete picture than "DuckDB scales better than Polars because TPC-H!"

pola-rs / polars-benchmark

Time Series benchmark #135