Incorrect CI PR benchmark comparison

ordo-one / package-benchmark

Swift benchmark runner with many performance metrics and great CI support

Apache License 2.0

313 stars 25 forks source link

Incorrect CI PR benchmark comparison #240

Closed ptoffy closed 5 months ago

ptoffy commented 6 months ago

Hi,

while trying to set up benchmarking in CI for Vapor's JWTKit, I'm running into some issues I can't get a hold of. The main issue is the wrongly measured comparison between branches, of which you can see an example here (it's a fork of the repo to keep the main repo "clean" until I find a solution). This PR brings absolutely no changes to the benchmarked code, it's just a test README update and as you can see the measurements are completely wrong. The same happens on the main repo too: https://github.com/vapor/jwt-kit/actions/runs/8492609046/job/23265790197?pr=152 The result is always either a benchmarkThresholdRegression or benchmarkThresholdImprovement error, that's why I tried to increase the thresholds for the benchmarks over here, however this doesn't seem to work as the threshold in the results (you can check the first linked PR results) is always 5. I also tried setting the thresholds on the single benchmarks directly but no luck. It's not a caching issue or similar as I tried enabling and disabling various metrics and that seems to work (as you can see the only thing enabled right now is throughput): something else must be wrong when updating the metrics. Finally, here is the workflow we're using to benchmark PRs, we tried following the one in this repo.

I've also considered that it might be solved with a private GH runner. What do you think?

Thanks!

hassila commented 6 months ago

Thanks for the report, unfortunately away for a couple of weeks and won't have time to check on this until then, sorry for delay.

vsarunas commented 6 months ago

Hi @ptoffy,

Yes I believe it is due to running on GitHub's CI infrastructure and "noise neighbours" for the most part.

Can verify locally by creating 3 runs on where development was done:

swift package --disable-sandbox benchmark baseline update run1 --target Signing
swift package --disable-sandbox benchmark baseline update run2 --target Signing
swift package --disable-sandbox benchmark baseline update run3 --target Signing

And comparing them:

swift package benchmark baseline compare run1 run2
swift package benchmark baseline compare run1 run3
swift package benchmark baseline compare run2 run3

The metrics on macOS have less variations than on Linux for me it seems; but they're a lot less than the linked workflow result.

You may find that increasing the number of samples by increasing the test duration will also create more stability:

Benchmark.defaultConfiguration.maxDuration = .seconds(10)

hassila commented 5 months ago

Hi @ptoffy , yeah @vsarunas is correct, running time/throughput measurements on GitHub infrastructure isn't a good proposition (which is why we recommend a standalone runner for such metrics if you want them in CI) - the environment is simply not controlled enough. We are considering adding e.g instruction count also (in https://github.com/ordo-one/package-benchmark/issues/242) which could be a better proposition (if we can get those stats on the runner, remains to be seen..) as it then is a proxy of the runtime in practice. Other metrics such as malloc count and sys calls works fine on GitHub ci though. Closing this for now, feel free to open up follow up issues as needed.