nvzqz / divan

Fast and simple benchmarking for Rust projects
https://nikolaivazquez.com/blog/divan/
Apache License 2.0
865 stars 24 forks source link

Estimating timer accuracy #1

Open vlovich opened 9 months ago

vlovich commented 9 months ago

Had a thought about this and was wondering about your thoughts:

Divan does not use timer accuracy because it wasn’t clear how accuracy can be obtained without a more accurate reference timer, when Instant is usually implemented with the most accurate timer. I’m open to making sample size scaling smarter, but the current approach works well enough.

I don't have the idea fully realized but I'm wondering if this approach might work:

  1. Benchmark how long it takes to add a number a clock frequency amount of time (e.g. if CPU is 6Ghz, do 6 billion additions). Don't actually need to retrieve the clock frequency unless you want to provide better bounds on how long the measurement takes. This gives us an estimate for how long it takes to add a single number. By picking a large number, we can make the overhead of the more coarse clock arbitrarily insignificant (e.g. if it's 1us and we measured 1s worth of additions, our estimate of 1 addition is accurate to within 0.0001%).
  2. Sample the inaccurate clock every N iterations of the loop such that time for N additions is within the accuracy you want out of the inaccurate clock (e.g. if an addition takes 1ns and RDTSC takes 41ns, 1 million iterations through the loop should get us to within 1.68 ns clock accuracy). That way you're only sampling the timer infrequently to get a higher resolution estimate across more samples.

Measuring RDTSC cost accurately in terms of instant is a similar approach - measure instant::now, RDTSC, nanosleep for 1 second. The Instant elapsed gives you 1s + sleep jitter + instant overhead + 2 * rdtsc overhead. Since nanosleep has a granularity of 100ns, you should be able to just reverse compute the most likely values of the overheads based on the actual value (ie. round down to nearest 100ns and see if the answers make sense & if they're too fast for rdtsc/instant overhead estimates, subtract another 100ns).

It may be a good idea to pin the CPU affinity to a single core when doing this to remove linux kernel scheduling vagaries. Also, P/E cores might make things complicated but I think that needs to be solved at a different level (i.e. controlling for affinity of running benchmarks on these cores & knowing which core a benchmark is pinned to). For benchmarks the simpler approach of only using P cores by default may be a good default & have the user explicitly select if they want it run on E cores instead (or if they want the same benchmark run for P & E).

I'm not 100% sure about the math/logic here, but it seems like this approach should work to me. Also, I'll note that additions aren't actually 1ns unless you have blocked the pipeline. For example, my CPU has 12 execution ports which means my CPU can actually do a dependency-free addition in 83 picoseconds. That's something else to be careful with when reasoning about benchmarking additions & counting loops; in the former I think you want to benchmark dependent additions because the loop over the benchmark is typically going to be a similarly dependent addition on the previous value of the loop counter. Obviously this stuff only matters for functions t