psf / pyperf

Toolkit to run Python benchmarks
http://pyperf.readthedocs.io/
MIT License
797 stars 78 forks source link

Add Heirarchical Performance Testing (HPT) technique to `compare_to`? #168

Open mdboom opened 1 year ago

mdboom commented 1 year ago

I recently came across a technique for distilling benchmark measurements into a single number that takes into account the fact that some benchmarks are more consistent/reliable than others, called Heirarchical Performance Testing (HPT). There is an implementation (in bash!!!) for the PARSEC benchmark suite. I ported it to Python and ran it over the big Faster CPython data set.

The results are pretty useful -- for example, while a lot of the main specialization work in 3.11 has a reliability of 100%, some recent changes to the GC have a speed improvement but with a lower reliability, accounting for the fact that GC changes have a lot more randomness (more moving parts and interactions with other things happening in the OS). I think this reliability number, along with the more stable "expected speedup at the 99th percentile", is a lot more useful for evaluating a change (especially small changes) than the geometric mean. I did not, however, see the massive 3.5x discrepancy between the 99th percentile number and the geometric mean reported in the paper (on a different dataset).

Is there interest in adding this metric to the output of pyperf's compare_to command?

vstinner commented 1 year ago

In pyperf, I tried to give the choice to the user to decide how to display data and to not take decisions for them. That's why it stores all timings, and not just min / avg / max. If there is a way to render data differently without losing the old way, why not. The implementation looks quite complicated.

mdboom commented 1 year ago

Yes, to be clear, this wouldn't change how the raw data is stored in the .json files at all -- in fact, it's because all of the raw data is retained that this can easily be computed after data collection.

I would suggest adding a flag (e.g. --hpt) to the compare_to command that would add the values from HPT to the bottom of the report. Does that make sense? If so, I'll work up a PR. My current implementation uses Numpy, but for pyperf it's probably best not to add that as a dependency. A pure Python implementation shouldn't be unusably slow (it's not very heavy computation).

vstinner commented 1 year ago

If it's a new option, and it doesn't change the default, i'm fine with it. Th problem is just how to explain it in the doc, shortly with simple words 😬