Inconsistent stress test results.

andrewgordstewart commented 4 years ago

Performance of post-review commits is a bit puzzling:

17f06b8 2441ms

bbecf7d 1231ms

(the numbers for 25 makePayment calls are very close to 25x the single makePayment call in both cases)

Originally posted by @geoknee in https://github.com/statechannels/statechannels/pull/2590#issuecomment-697271681

andrewgordstewart commented 4 years ago

My take on this is that 17f06b8 might have been an anomaly.

There appears to be a significant variation in the run time of the stress test on circle. When we ask circle to run a benchmark, we don't have control over the resources, and it's not clear what guarantees circle makes about the available resources. Perhaps, even though we pay for 8 cores, circle might not always let us fully utilize those 8 cores. One explanation for that phenomenon could be the use of spot instances to lower prices at the cost of variable execution times.

Some circumstantial evidence to back up this theory:

There is a huge variance in eg. the Spin up environment step of the test job:
- 2m15s * fixed
- 32s
When I ran the stress test locally (while plugged in) before and after reverting 17f06b87e332216d71a9ca2e903f3f07f658b484, I saw a consistent time of ~30s to run the stress test.
When I ran the stress test on @lalexgap ' AWS instance (when comparing a different number of worker threads) I observed repeatable times reported in the stress test.

geoknee commented 4 years ago

2m15s

32s

I think you accidentally linked to the same job.

Regardless, what you say seems very plausible to me. Therefore for changes such as these we should be running the stress test (et al) on more predictable hardware (just a developer's laptop, say).

andrewgordstewart commented 4 years ago

One thing I have been thinking about since the web3torrent flickering tests is integrating some statistical analysis in our test suite. Two examples are:

run a time series regression on the stress test performance to detect statistically likely regressions in performance
fit a simple time series model to eg. an integration test job such that, if it flickers, the model can suggest commits that are most likely to have introduced the flickering bug.

(1) is much better resolved by ensuring that we have a consistent set of resources available to run the benchmark. In that case, a benchmark that runs for ~1 minute can generally be trusted to be accurate enough to have a hard cut-off to detect performance regression.

(2) is probably too narrow a problem to worth spending the time on, but I wonder if anyone does this strategy.

andrewgordstewart commented 4 years ago

Regardless, what you say seems very plausible to me. Therefore for changes such as these we should be running the stress test (et al) on more predictable hardware (just a developer's laptop, say).

It is probably worth requesting a budget for something like an appropriate, dedicated, on-demand EC2 instance for running benchmarking jobs.

statechannels / statechannels

Inconsistent stress test results. #2606