Handle benchmarking warmup runs better

antiagainst commented 2 years ago

Right now for each benchmark, we run 10 repetitions to collect statistics. For each repetition, we use Google Benchmark's warmup mechanisms to deduce how many iterations should run, which gives Google Benchmark confidence that the numbers are stable and representative. However, Google Benchmarks are more geared towards microbenchmarks. For full model benchmarks we have, it can cause Google Benchmark to believe one iteration will suffice. Only a single iteration can be problematic as it might just be a cold start run that happens to take longer time. So even if we have 10 repetitions, they can all be cold start runs, which skews the result. So we might want to handle warmups more explicitly.

GMNGeoffrey commented 2 years ago

How could 10 repetitions all be cold starts? I'd think that any warmup should last through repetitions?

benvanik commented 2 years ago

I haven't looked at a trace with repetitions > 1, but does the benchmark state get reinitialized each repetition? I'd assume yes, as otherwise there'd be no difference between repetitions and more iterations - in which case Lei is right: not all warmup benefits (especially in GPU cases) last across full initializations (when everything gets torn down all the caches get torn down as well, the GPU resets all its internal program state, etc).

benvanik commented 2 years ago

(I'm now somehow curious - going to lazily test this now by just running the bots with repetitions in traces :)

antiagainst commented 2 years ago

Good question, Geoffrey! I haven't looked into all the details of how Google Benchmark's handles repetitions and iterations; I assume they are different as Ben said regarding initialization and what's shared between runs. But I could be wrong there. I'll also dig a bit more on this. :)

antiagainst commented 2 years ago

Though even if only the very first is a cold run, it can still skew the result if we only have 10 runs in total. The extent depends on how "cold" it is for sure.

antiagainst commented 2 years ago

I did a quick iree-benchmark-module run with export VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_api_dump, which dumps all Vulkan API calls. So with repetitions 10, I'm only seeing one vkCreateDevice/vkDestroyDevice pair. So at least we aren't tearing down the whole Vulkan instance/device each time. Will wait Ben to tell how the traces say. :)

benvanik commented 2 years ago

aww the benchmarks timed out after 1hr ;( will try with fewer reps

GMNGeoffrey commented 2 years ago

aww the benchmarks timed out after 1hr ;( will try with fewer reps

You can increase the timeout in the http://google3/third_party/iree/build_tools/buildkite/cmake/android/arm64-v8a/benchmark2.yml file. Could also delete a bunch of the benchmark configurations from https://github.com/google/iree/blob/main/benchmarks/TFLite/CMakeLists.txt so it doesn't run those

benvanik commented 2 years ago

Results here: https://buildkite.com/iree/iree-benchmark/builds/1639#00ac1aa6-b799-463a-bb95-6b2209a65e56 Looks like repetitions are fine - no need to rewarm! (the center divide - or the "frames" if lookin in tracy, are the splits between repetitions)

benvanik commented 2 years ago

this also helped me find a bug in the benchmark tool that causes the benchmark name to die before the second repetition:

whenever I get around to switching it to the C API I'll make sure that's fixed - it's currently passing std::string storage to benchmark::RegisterBenchmark, and unfortunately benchmark doesn't retain the strings: https://github.com/google/iree/blob/89e9530b79e7946e04ccde9b0ac226a3f98dbffa/iree/tools/iree-benchmark-module-main.cc#L111

GMNGeoffrey commented 2 years ago

Thanks for the investigation! That does mean that the traced run won't be a warm run though. Is that ok?

Another thing we should consider to deal with outlier slow runs is reporting percentiles instead/as well as the mean. Sean made a compelling case for using the minimum time for benchmarking, since that's the one that's most reliable and includes the least noise (CPU decides to just make you wait for 10ms). I would be a bit leery of only measuring that, since we care about tail latency as well, but it seems reasonable to report that (and it doesn't cost us anything extra). Note that we can really only do the minimum average over 1 seconds-worth of runs. Could also use 10th percentile 50th percentile (median), and 90th percentile. Benchmark gives us mean, median, stddev, and coefficient of variation by default, but we could add other statistics pretty easily (https://github.com/google/benchmark/blob/main/docs/user_guide.md#custom-statistics). I think I like that better than trying to figure out how much warmup to do. Doing a single warmup run should be fine, but anything beyond that and I think we should turn to statistics instead.

benvanik commented 2 years ago

I think it'd definitely be worth having 2 reps in the tracy ones given how wildly the warmup can differ from sustained: I only had 7mins until the timeout on that last run with 2 reps though so maybe after some of the benchmarks have been pruned :)

Also big thumbs up to having min time as well as percentiles knowing what a device can do is really useful (alongside what it often does) - depending on the phase of the moon there can be swings of 10's of % or more between min and mean.

antiagainst commented 2 years ago

Thanks for the details! Interesting. So repetitions and iterations are basically the same thing under the hood, other than the fact that repetitions are used for collecting statistics..

Also +1 to min time. For other percentiles, I think we may want to defer until that's really proven needed. To make them useful we need more iterations (like in 10 repetitions each 10% percentile only accounts for 1 sample); we are already having issues with each benchmark job takes too long right now. This does mean another x2 over the current benchmark series though in Dana. Fun time. :)

antiagainst commented 2 years ago

Actually even for min time, I'm wondering how much it differs from average time after we pin down both CPU/GPU frequency. My feeling is for the GPU side the numbers are quite stable after that. Worth evaluating once we have Pixel 6 set up. :)

benvanik commented 2 years ago

which devices? from what I see now the variance can still be quite high for across every iteration, but I can't keep straight what y'all are fiddling with :)

antiagainst commented 2 years ago

I should reorder these two sentences:

My feeling is for the GPU side the numbers are quite stable after that. Worth evaluating once we have Pixel 6 set up.

into

Worth evaluating once we have Pixel 6 set up. My feeling is for the GPU side the numbers are quite stable after that.

Sorry, messed up with availability / visibility. ;-P

benvanik commented 2 years ago

Gotcha. My hope is that we have more than just one or two android devices at some point, and not everything we want to run will have the same configuration ability (like desktop GPUs, or Apple devices, etc) - so I wouldn't rule out min time :)

GMNGeoffrey commented 2 years ago

Yeah the main issue with collecting these is the UI organization. This is something that would be very natural to collect if everything went first into a database and then we could decide what to actually surface and do regression analysis on based on usefulness. Right now those are too tightly coupled.

Actually even for min time, I'm wondering how much it differs from average time after we pin down both CPU/GPU frequency

Just spot checking I actually haven't seen significant stability improvements in the CPU numbers after I pinned down the frequency. e.g. here's MobileBert:

mobilebert

That doesn't look like any difference to me. I guess it's not implausible that we were already effectively hitting max frequency naturally by giving it a lot of work and having it on a cooling plate.

iree-org / iree

Handle benchmarking warmup runs better #7795