Open ritchie46 opened 3 years ago
Benchmarks are most certainly the plan! Any preference on how though? Part of me likes the idea of running it on Github Actions, but I'm wondering if they provide consistent hardware. There's also "what datasets shall we use" and "where to host those". I may also imagine that we may want to consider versions of functions. After all, there may be multiple ways to implement "sessionize".
Come to think of it, do we really want to download large datasets and run potentially long-running benchmarks in Github Actions?
Could create CI that only runs on manual triggers. In the polars repo, we create the datasets instead of downloading.
The VM's are shared, but I do think that within a pipeline we have the same compute (not really sure though), this would make relative comparisons still sensbible within one run.
Fair enough. Let's try and start with GithubCI just to keep things simple.
Where would you want to store the data from the benchmark results? Do we want to store the results of the runs in git?
Hmm.. that's maybe a good idea yes. We could store it in a separate clean branch. The whole benchmarking is a large todo still.
I also want to run TCPH benchmarks in the polars repo, which would need dedicated compute. I can imagine eventually setting up a database etc.
I just got a base thing goin' on my local multiple dispatch branch.
What does TCPH stand for?
Also, simulating some of these datasets is tricky. How might we properly simulate a session dataset?
I was thinking about building a memo script, but I'm open to other ideas too. It kind of depends on how accurate you'd like these numbers. There's also stuff like measure parquet vs. csv and/or number of CPUs.
As we compare different tools here. It would be cool to run benchmarks from this repo.
Maybe in CI, and later maybe even a dedicated runner.
These can could then be shown on the website. I am already assuming here that polars does great. :smile: