microsoft / xunit-performance

Provides extensions over xUnit to author performance tests.
MIT License
188 stars 55 forks source link

Discuss: iteration "philosophy" #8

Closed ericeil closed 8 years ago

ericeil commented 9 years ago

Several other issues have mentioned how iterations should be handled, so I thought this warranted a dedicated issue.

The basic question is, for any given test, how many times should it be run? My thinking on this is as follows:

It should be trivial to author a simple perf test that yields useful results.

I should be able to write something like:

[Benchmark] void MyTest() { SomeMethod(); }

...and get a useful measure of that method's performance.

A useful result typically requires multiple iterations.

Performance, particularly in .NET, is inherently non-deterministic. The only way to get a useful idea of the performance of a given piece of code is to run it multiple times, and then use statistical techniques to characterize the data gathered.

It's not generally possible to predict the number of iterations required, ahead of time.

The statistical methods used determine how many "sample" measurements are needed, and this will vary depending on several factors:

a) The particular mix of operations in the test itself. Some methods are simple, single-threaded, CPU-bound, and therefore are not inherently non-deterministic, but most will end up allocating (involving the GC, which is non-deterministic), interacting with other threads, interacting with I/O devices, etc.

b) The state of the machine on which the test is running. It's nearly impossible to isolate the execution of a single process from everything else happening on/to the machine. For example, a sudden surge in network traffic may consume CPU resources for interrupt processing, making the test code seem to take longer for the duration of the surge. Or, there may be more physical memory usage in other processes on one machine vs. another, or at one time vs. another, causing the GC to collect more aggressively, temporarily.

The effects in (a) are largely predictable, though it may require a fair amount of experimentation to find an iteration count that consistently leads to useful results. However, the effects in (b) are completely unpredictable, especially in a system that needs to work reliably outside of the controlled environment of a dedicated performance lab.

Performance testing should be fast

On the CLR team, we've traditionally compensated for the unpredictability in perf tests by hard-wiring them for a large number of iterations. Much of the time this ends up being a waste of (computer) time. If we have to run every test enough times that we could get a useful result if we were running in a noisy environment, we limit the number of tests we can run in a clean environment, in a given amount of time. Also, we've ended up running many relatively stable tests as if they were unstable, simply because it's a pain (and time-consuming) to go analyze every test case and determine the optimal number of iterations; so we run even the stable, predictable tests more than needed, further wasting machine time on every run.

The wasted machine time becomes a much larger issue when we start asking OSS contributors to run these tests on their own machines prior to submitting changes.

The above, to me, argues for a dynamic system that compensates for unpredictability at run-time, by default. We should be running the test until we have some confidence that we have a useful result, for that particular run. I think the above also argues that it's harmful to offer any kind of mechanism for hard-wiring particular iteration counts, as they will inevitably lead to tests that either don't produce useful results, or waste time producing overly-accurate results. I can see configuration of extreme upper limits, as "back-stops" against the run-time algorithms going astray, but I worry that those would be quickly, and quite accidentally, abused. A backstop might be better implemented as an overall execution time limit; if the test run exceeds that limit, it simply fails, rather than producing useless, but seemingly valid, results.

Of course, that's just one point of view. I'm eager to hear others.

AndyAyersMS commented 9 years ago

If you are going to start in on statistics to deal with noisy metrics, I would recommend looking at something like a bootstrap to estimate metrics and their confidence intervals from limited sized samples. For instance you could increase iterations until the confidence interval of your key metric (median time, say) from repeated runs at that number of iterations is small enough for your liking. And if this doesn't work out, you'd know you have a noisy test with some uncontrolled influence.

Bootstrap can also be used to compute p-scores when comparing two sets of runs to see if there is a statistically significant regression or improvement in the metric. I have some sample code I could donate if you're interested.

There has also been some good work on randomization to support statistically sound performance measurements, see for instance Curtsigner and Berger's STABILIZER, but this may be a bit more to bite off than you all had in mind.

pharring commented 9 years ago

Really interesting links, @AndyAyersMS. Thanks.

ericeil commented 9 years ago

Update on the current status of this: After playing around with the (very naïve) mechanism I'd built, I'm convinced that it's worse than a simple "run N times, or for X milliseconds, whichever is shorter" loop. I've changed it to just that for now.

The more I think / read about this (how many samples are "enough") the more I'm realizing that this really is not the place for this decision to be made. Nor should the tests themselves have direct control over the iteration strategy. This really is properly a function of the particular goal of a particular test run. For example, if the purpose of a run is to characterize the absolute values of various Benchmark metrics, then it will need to ensure that we have enough samples of each metric to yield a high confidence in the values of each metric/statistic being considered. In addition, the global state of the machine will need to be carefully controlled to reduce noise/bias/etc.

However, the more common case (for us anyway) is to do a run with the goal of comparing the performance of two different builds. In that case, we don't care much about the absolute performance of either build, we just want to know if the changes in one build make it faster or slower than the other. The "experiment" in this case might be designed very differently. For example, we might make a series of rough comparisons between builds, alternating between the baseline and current builds, collecting only a few data points each time, which would allow analysis that minimizes the influence of other things happening on the machine. Any given "run" of the tests would, in this case, only be collecting a subset of the total samples used to make the final evaluation. And the total number of samples needed from one build would likely depend on the behavior of the other build, and vice-versa. The inner loop of the test harness is unlikely to be able to make good decisions, on its own, about how much is "enough" in these cases. And the individual test case is of course completely clueless.

From an API point of view, I think this implies that individual tests should not have tight control over iteration counts, strategies, etc. I do think it makes sense for them to contain declarative metadata about themselves, communicating things that may be useful to the engine that's driving the overall "experiment." For example, a test may declare that it allocates from the GC heap (which may or may not end up being useful information), but not that it should be run at least 1000 times.

pharring commented 9 years ago

Thanks for the write-up, @ericeil. I like treating a performance run as an experiment trying to validate or disprove a hypothesis i.e. "Hypothesis: These two builds have identical performance characteristics"

So, where does that leave us? What does the outer loop look like? So far, we've been thinking of it as a passive, fixed thing ("run test x, run test y") with post-run analysis. Now it to become more dynamic, analyzing and responding to outputs (a feedback loop) as well as inputs.

sajayantony commented 9 years ago

@pharring
@dmetzgar was recommending something like path length. Do we the capability to look at CPU cycles or CPU time between 2 events of a process. I understand this would not help with GC variation that but can definitely help identify if the number of instructions have gone up for certain number of iterations.

dmetzgar commented 9 years ago

We used path length to overcome some of the unpredictability in Azure when performance testing services. Disk and network I/O could be easily affected by neighbors and it was impossible to detect regressions from build to build. Also in some data centers you get a mix of different CPU clock speeds on your VMs. By path length I mean we sampled CPU cycles for the target process before and after the test and divided that by the number of iterations. Native processes work pretty well. Managed processes you have to account for the GC, like was pointed out above.

pharring commented 9 years ago

Thanks for the info, @dmetzgar Can you share (offline if you like) the code you used to query CPU cycles accurately?