Open mdboom opened 1 year ago
In pyperf, I tried to give the choice to the user to decide how to display data and to not take decisions for them. That's why it stores all timings, and not just min / avg / max. If there is a way to render data differently without losing the old way, why not. The implementation looks quite complicated.
Yes, to be clear, this wouldn't change how the raw data is stored in the .json files at all -- in fact, it's because all of the raw data is retained that this can easily be computed after data collection.
I would suggest adding a flag (e.g. --hpt
) to the compare_to
command that would add the values from HPT to the bottom of the report. Does that make sense? If so, I'll work up a PR. My current implementation uses Numpy, but for pyperf it's probably best not to add that as a dependency. A pure Python implementation shouldn't be unusably slow (it's not very heavy computation).
If it's a new option, and it doesn't change the default, i'm fine with it. Th problem is just how to explain it in the doc, shortly with simple words 😬
I recently came across a technique for distilling benchmark measurements into a single number that takes into account the fact that some benchmarks are more consistent/reliable than others, called Heirarchical Performance Testing (HPT). There is an implementation (in bash!!!) for the PARSEC benchmark suite. I ported it to Python and ran it over the big Faster CPython data set.
The results are pretty useful -- for example, while a lot of the main specialization work in 3.11 has a reliability of 100%, some recent changes to the GC have a speed improvement but with a lower reliability, accounting for the fact that GC changes have a lot more randomness (more moving parts and interactions with other things happening in the OS). I think this reliability number, along with the more stable "expected speedup at the 99th percentile", is a lot more useful for evaluating a change (especially small changes) than the geometric mean. I did not, however, see the massive 3.5x discrepancy between the 99th percentile number and the geometric mean reported in the paper (on a different dataset).
Is there interest in adding this metric to the output of
pyperf
'scompare_to
command?