Benchmarks - Githubissues

pola-rs / valves

general functions for your data .pipe()-lines.

16 stars 2 forks source link

Benchmarks #6

Open ritchie46 opened 3 years ago

ritchie46 commented 3 years ago

As we compare different tools here. It would be cool to run benchmarks from this repo.

Maybe in CI, and later maybe even a dedicated runner.

These can could then be shown on the website. I am already assuming here that polars does great. :smile:

koaning commented 3 years ago

Benchmarks are most certainly the plan! Any preference on how though? Part of me likes the idea of running it on Github Actions, but I'm wondering if they provide consistent hardware. There's also "what datasets shall we use" and "where to host those". I may also imagine that we may want to consider versions of functions. After all, there may be multiple ways to implement "sessionize".

koaning commented 3 years ago

Come to think of it, do we really want to download large datasets and run potentially long-running benchmarks in Github Actions?

ritchie46 commented 3 years ago

Could create CI that only runs on manual triggers. In the polars repo, we create the datasets instead of downloading.

The VM's are shared, but I do think that within a pipeline we have the same compute (not really sure though), this would make relative comparisons still sensbible within one run.

koaning commented 3 years ago

Fair enough. Let's try and start with GithubCI just to keep things simple.

Where would you want to store the data from the benchmark results? Do we want to store the results of the runs in git?

ritchie46 commented 3 years ago

Hmm.. that's maybe a good idea yes. We could store it in a separate clean branch. The whole benchmarking is a large todo still.

I also want to run TCPH benchmarks in the polars repo, which would need dedicated compute. I can imagine eventually setting up a database etc.

koaning commented 3 years ago

I just got a base thing goin' on my local multiple dispatch branch.

What does TCPH stand for?

Also, simulating some of these datasets is tricky. How might we properly simulate a session dataset?

koaning commented 3 years ago

I was thinking about building a memo script, but I'm open to other ideas too. It kind of depends on how accurate you'd like these numbers. There's also stuff like measure parquet vs. csv and/or number of CPUs.