nvzqz / divan

Fast and simple benchmarking for Rust projects
https://nikolaivazquez.com/blog/divan/
Apache License 2.0
849 stars 24 forks source link

Fluctuating CPU MHz makes `tsc` profiling lie #23

Closed nakedible closed 7 months ago

nakedible commented 8 months ago

This isn't a fully debugged issue report yet, nor do I have a clean reproduction recipe yet...

Sometimes when I benchmark, attempting to force divan to use TSC, I get results where timer accuracy is 20 ns (this is Github Codespaces so extremely heavily virtualized):

$ cargo bench --bench basic -- --timer tsc --sample-size 81920
    Finished bench [optimized] target(s) in 0.02s
     Running benches/basic.rs (target/release/deps/basic-2115956f1e3e5fd7)
Timer precision: 20.03 ns
basic               fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ get_bounds       0.306 ns      │ 0.564 ns      │ 0.306 ns      │ 0.322 ns      │ 100     │ 8192000
├─ get_cold         0.306 ns      │ 0.505 ns      │ 0.306 ns      │ 0.313 ns      │ 100     │ 8192000
├─ get_sat          0.307 ns      │ 0.697 ns      │ 0.307 ns      │ 0.315 ns      │ 100     │ 8192000
├─ get_unreachable  0.306 ns      │ 0.924 ns      │ 0.338 ns      │ 0.407 ns      │ 100     │ 8192000
├─ get_unsafe       0.306 ns      │ 0.468 ns      │ 0.306 ns      │ 0.308 ns      │ 100     │ 8192000
├─ harp             1.847 ns      │ 3.775 ns      │ 1.848 ns      │ 2.088 ns      │ 100     │ 8192000
╰─ wrapping_add     1.848 ns      │ 2.068 ns      │ 1.848 ns      │ 1.862 ns      │ 100     │ 8192000

However, every now and then, divan figures out the timer accuracy is 29.85 ns, in which case I see wildly differing values:

$ cargo bench --bench basic -- --timer tsc --sample-size 81920
    Finished bench [optimized] target(s) in 0.05s
     Running benches/basic.rs (target/release/deps/basic-2115956f1e3e5fd7)
Timer precision: 29.85 ns
basic               fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ get_bounds       0.228 ns      │ 0.364 ns      │ 0.228 ns      │ 0.237 ns      │ 100     │ 8192000
├─ get_cold         0.228 ns      │ 0.823 ns      │ 0.228 ns      │ 0.238 ns      │ 100     │ 8192000
├─ get_sat          0.229 ns      │ 0.354 ns      │ 0.229 ns      │ 0.232 ns      │ 100     │ 8192000
├─ get_unreachable  0.228 ns      │ 0.502 ns      │ 0.26 ns       │ 0.26 ns       │ 100     │ 8192000
├─ get_unsafe       0.228 ns      │ 1.162 ns      │ 0.228 ns      │ 0.269 ns      │ 100     │ 8192000
├─ harp             1.769 ns      │ 3.097 ns      │ 1.892 ns      │ 1.972 ns      │ 100     │ 8192000
╰─ wrapping_add     1.77 ns       │ 3.783 ns      │ 1.77 ns       │ 1.97 ns       │ 100     │ 8192000

This happens also if I don't specify a sample size, but I wanted to fix it for this example so that it's clear the difference isn't because of a different iters value.

This is an extreme microbenchmark, but I think I was seeing similar on larger benchmarks at well.

Actually, while writing this, I think I might've figure out the issue – different CPUs on this system have different MHz, and they seem to be constantly changing.

Not sure what divan can do about that :-)

$ cat /proc/cpuinfo  | grep MHz
cpu MHz         : 3242.971
cpu MHz         : 2785.335
cpu MHz         : 3234.546
cpu MHz         : 3245.406
cpu MHz         : 2913.887
cpu MHz         : 2933.662
cpu MHz         : 3096.559
cpu MHz         : 2445.426
vlovich commented 7 months ago

Yes, when benchmarking / profiling code it's not uncommon to override the governor to use the performance governor to remove the "random" clock changes that normal governors will make. That's less about TSC & more about the fact that your code is actually taking a variable amount of time. And honestly, +/- 30% for something taking < 1ns isn't surprising. At that point, the loop counter used for doing the repeated iteration within the timing code is going to be a meaningful part of the measurement. Running within a VM is going to be another contributing factor for measurement error.

nakedible commented 7 months ago

Ultimately I spent quite a while in testing different configurations for this. What I need is benchmarks for something that's in the nanosecond range, where the benchmark will reliably tell if a new version is faster or slower than the old one, where reliably means maybe only 1 in twenty mistakes, but preferably 1 in a hundred.

I tried a lot of options with divan, also on a quiet desktop PC, but I was unable to achieve such reliable benchmarks. In the end, criterion ended up being the better choice for me, even though the benchmarks are really slow.

I don't know if there's any sense in keeping this "bug" open, so I'll close it.

nvzqz commented 7 months ago

I tried a lot of options with divan, also on a quiet desktop PC, but I was unable to achieve such reliable benchmarks.

So is this issue not just with the TSC then?

vlovich commented 7 months ago

I don't think so. I have observed result instability For what it's worth as well. I suspect criterion has similar behavior and the main differentiating factor is the statistical analysis done within that crate more than the fact that they iterate so much.

nakedible commented 7 months ago

Yes, it happens also without TSC. But I need to spend a bit more time on this so I can actually give reasonable comments and not just guesswork.