make benchmarks more stable

rust-lang / rust

Empowering everyone to build reliable and efficient software.

https://www.rust-lang.org

Other

96.71k stars 12.5k forks source link

make benchmarks more stable #77661

Closed GopherJ closed 2 years ago

GopherJ commented 3 years ago

Describe the problem you are trying to solve

currently cargo bench isn't so stable, it doesn't run long enough, and the data can vary a lot (20-30%), which makes it hard to know if there is really a regression or not.

Describe the solution you'd like no sorry

Notes

Eh2406 commented 3 years ago

cargo bench is a wrapper around functionality in rustc. If you want to change the behavior https://github.com/rust-lang/rust is probably a better place to discuss it. On the other hand the bench is unstable exactly because it is not robust or flexible enough. There is work in rustc to make it more a plugin system. The recommendation at this time is to use the https://crates.io/crates/criterion .

ehuss commented 3 years ago

Transferred to the rust-lang/rust repository, as that is where the libtest harness lives. Unfortunately, I don't think it is likely there will be much work done on libtest's benchmarking, as the future is currently uncertain (see #29553 and #66287). You will likely have better support for external benchmarking frameworks like criterion.

the8472 commented 3 years ago

#[bench] measures iterations per walltime interval, more or less. So if you don't want to switch to a different benchmark crate that supports instruction counting or does more sophisticated analysis you'll have to bring your system into a state that causes less variance. I.e. shut down background tasks, disable CPU clock boosting and check for thermal throttling which often is a problem when benching on laptops.

GopherJ commented 3 years ago

@the8472 even with that the results can change a lot:)

the8472 commented 3 years ago

At least in Vec-related things I have been working on recently I have seen variances for a null run in the 2-10% range with two outliers around 20% (among dozens of benchmarks). But that's pure CPU/memory throughput benchmarks. If you start doing syscalls or even randomized allocations things will become noisier.

Mark-Simulacrum commented 2 years ago

I'm going to go ahead and close this issue, as it seems to me that it's largely a consequence of the overall bench design (wall time, not instruction counts, for example) which seems unlikely to get much more sophisticated inside the standard library. And, realistically, unless you're doing software emulation of some kind, most larger benchmarks will have some amount of uncertainty, especially if they have syscalls or the like.