Check for statistical significance of changes

ionelmc / pytest-benchmark

pytest fixture for benchmarking code

BSD 2-Clause "Simplified" License

1.25k stars 119 forks source link

I think it would be nice to not only know if the mean and other related stats changed between one version and another but to know if that difference is statistically significant. Due to the highly skewed nature of most timing data, a standard T-Test would be invalid, so I think a Mann-Whitney U Test would be appropriate. I just found this project, and I'm still looking through the code base to figure out where I could implement the test, but I would need access to all the timing data from the previous run. Some pointers on where to start would be helpful.

The Mann Whitney U Test is implemented in scipy, but that would be a non-trivial dependency to add to the project. It's not particularly difficult to implement, but writing our own would, of course, be more code to maintain.

As I understood the page the U-Test is designed to compare stats that have the same distributions. Seems like this would be better served by a hook (like you could implement a hook in your conftest.py and use scipy, and other people could use a T-Test or whatever).

The tricky part would be how to display the results - this plugin compares aggregates like min/max/avg and not the raw stats.

Also, there's the constraint that you can only compare exactly two runs - this plugins allows comparing any number of runs.

@JustinTervala I wonder, since you have json data (and you can tell the plugin to dump all the stats with --benchmark-save-data) why don't you just implement this specific comparison yourself?

ionelmc / pytest-benchmark

Check for statistical significance of changes #108