(Micro)Benchmarking reliability and consistency

NthPortal commented 3 years ago

Motivation

A frequent request for scala/scala PRs (particularly collections changes) is that the changes be benchmarked; however, many obstacles exist for contributors running benchmarks on their personal computers, to the extent that many or perhaps most results would generously be classified as "questionable".

Background

The following are some common causes of performance/timing variance, and whether a particular type of machine avoids it.

	Laptop	Overclocked Desktop	Normally-clocked Desktop
Boost clock speed change	❌¹	❌	✔
Thermal throttling	❌	❌	❓²
Background tasks	❌	❌	❌

¹ While theoretically possible to turn off overclocking/boost-clocking on a laptop, the CPU may clock down due to even brief changes in battery/power state as well. ² A normally-clocked desktop with good ventilation and cooling shouldn't thermally throttle, but neither of those is a guarantee in a person's home (sometimes cats sit on computers, for example).

The only type of machine that avoids any of these issues is a normally-clocked desktop, and not everyone has one of those (many of us only have laptops).

Additionally, all personal computers suffer from the problem that there are almost certainly background tasks (if not foreground tasks) running on them at all times. Benchmarks can take a long time to run, and even if someone can manage to not use their computer for an hour or two while benchmarks run, they probably don't want to have to close their web browser, 3+ chat applications (that are all electron, so basically also web browsers), and half a dozen other running programs and services. If they can't spare potentially multiple hours of their computer being tied up, it's even worse, with foreground tasks taking arbitrary and inconsistent CPU time.

Ideal Setup

To have benchmarking be reliable, it should be done on a dedicated machine running nothing else, and where cron/scheduled jobs are never running while a benchmark is running.

How do we reliably benchmark library changes?

lrytz commented 3 years ago

Thanks for bringing this up! For more technical aspects, see also https://github.com/scala/scala-dev/issues/338.

We have one machine that we use for compiler benchmarks. It's not that busy, maybe we can find a good way to make it available to contributors; allow them to take the machine offline in Jenkins and ssh to it.

lrytz commented 3 years ago

@retronym says he'll look into this.

retronym commented 3 years ago

I spend some time trying to get our Jenkins instance to have a parameterized Job that could run specified benchmarks on our benchmarking server. Jenkins seems to actively resist this and wouldn't save my job configs and I had to park the attempt. I'll try again...

retronym commented 3 years ago

https://issues.jenkins.io/browse/JENKINS-64454 perhaps.

retronym commented 3 years ago

@adriaanm That Jenkins ticket mentions disabling the notifications plugin as a workaround -- after that the save/apply button actions on the job config UI worked again. I notice we're running 1.13 of the plugin but previous were running a custom build you'd created:

Can you provide context for that custom version? Is this something that you're working on now or something you worked on previously?

retronym commented 3 years ago

Better Jenkins ticket: https://issues.jenkins.io/browse/JENKINS-64254

https://issues.jenkins.io/browse/JENKINS-64072?focusedCommentId=401604&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-401604

Ichoran commented 3 years ago

There are a variety of ways to solve this experimentally instead of with quiet hardware. For instance, you can halve the number of iterations and run the whole thing twice. If any of the head-to-head comparisons aren't stable, you disbelieve the whole lot and do it all again. Usually, in my experience, they are pretty stable even on a laptop as long as you're not doing a million other things at the same time. (Watching video + compiling + benchmarking is probably a bad idea. Editing code and benchmarking is probably fine.)

You do always have to run the benchmarks head-to-head at roughly the same time and not expect them to be stable over days/months/whatever. If you're trying to search for performance regressions then you do want the quiet machine approach. But for regular PRs, I don't think it's necessary.

Note that a bigger problem is different architectures. It's often the case that code that is faster on one architecture is slower on the other. So you can have different people making different decisions about high-performance code based on accurate microbenchmarking on different hardware.

SethTisue commented 3 years ago

an older ticket with a bunch of benchmarking advice: https://github.com/scala/scala-dev/issues/606

scala / scala-dev