Open FirefoxMetzger opened 1 year ago
I've done a bit of research on benchmarking and found it hard to come up with a good strategy. What do you propose?
Have a look at https://github.com/python/pyperformance which is used for benchmarking Python itself. It allows you to create a suite of test to regularly run.
The biggest challenge comes from having a properly configured benchmark machine, as a lot of factors can contribute to different benchmark results that have nothing to do with the actual code changes. But I still think it's worthwhile to check for unintended regression that are big enough to be caught by noisy benchmarking.
As long as you use the same hardware and follow some best practices for benchmarking, I think you'll be fine. Having some numbers may be better than none at all. We can always improve the benchmark setup once the noise starts to dominate.
Scikit-Image also maintains a benchmark suite. At the time I was contributing it was still being finalized, so I can't say how much it is actually enforced, but it could serve as inspiration:
Docs: https://scikit-image.org/docs/stable/contribute.html#benchmarks CI: https://github.com/scikit-image/scikit-image/blob/main/.github/workflows/benchmarks.yml
I really like the airspeed velocity tool that sk-image uses! Like Ivo indicates, to me the core issue has always been to compare results from one benchmark run to another, between machines and even the same machine over time. The concept that av brings to the table - just run the benchmark for older commits here and now on the same machine, and compare - is a pretty clever solution!
I've been working on a continuous benchmarking tool called Bencher that supports both use cases, either tracking benchmarks over time for comparison or using relative benchmarking (similar to asv
) : https://github.com/bencherdev/bencher
The idea is for it to be like a pyperformance
for your application code. Would that be helpful here?
Since we constantly talk about improving the performance of the routines provided by pylinalg, I was wondering if there are any plans on formally tracking benchmarks of various routines to make sure that we don't regress in terms of speed.