windsock: get stable results

Running the same bench twice does not give stable results, neither locally nor on AWS But I'm focusing on AWS atm because that feels more important.

network and disk IO does not go above 5MB, so it feels unlikely that we are hitting limits there.

I've noticed that shotover benches will be off by a certain % for the entire bench While non shotover benches will be based around 0.0% and then go up or down from there before returning to 0.0%

So it looks like shotover is introducing a second kind of noise. So we should first address the noise without shotover.

cassandra benches have the bencher setup to only use 1 thread. using 2 threads on a m6a.large instance seems to make noise worse But maybe using more threads on a larger instance will help?

Maybe I need a better idea of what people have historically found to be stable.

cassandra,compression=none,driver=scylla,operation=read_i64,protocol=v4,shotover=none,topology=single observations

I've now observed that latte is more consistent than windsock ~~windsock seems to consistently drop performance at the 42-44s into the benchmark.~~ (resolved) Not sure if there were other differences in consistency observed.

I attempted to rewrite windsock's bencher to be more like latte, but it did not help. Either need to profile bencher to find out whats going on or I need to blindly try copying more logic from latte.

cassandra,compression=none,driver=scylla,operation=read_i64,protocol=v4,shotover=standard,topology=cluster3 observations

latte has 10x more throughput than shotover in its default configuration. latte gets ~60000 OPS while windsock gets ~5000 OPS Increasing lattes thread count drops latte performance wow, I can get numbers similar to latte by setting --operations-per-second 50000 as soon as I set --operations-per-second 55000 actual OPS drops to 5000.

If I set the bencher OPS to 50000, shotover will meet exactly 50000 OPS. However if I set the bencher OPS to 55000, depending on the run, shotover may reach 55000, or it may get stuck at a much lower OPS, I've seen as low as 5000. If I then set OPS to unlimited it pretty much always runs at 5000 OPS Latte doesnt seem to experience this same cliff, it does seem to max out at about the same point that shotover can reach (60000) on its default configuration. But if I increase the number of concurrent messages to 500 it can hit 80000

shotover=none gives similar throughputs for latte and windsock but latte still a bit higher. Here increasing thread count does actually improve latte performance.

Things to try:

profiler the bencher
tokio-console on the bencher
try updatings deps on latte

This PR has shown promise: https://github.com/shotover/shotover-proxy/pull/1360

However, I think the next step is to add functionality to windsock to allow reusing EC2 instances. This will eliminate the noise caused by differences in EC2 instances. I am thinking an API like this:

> # Create the resources required to run the benches specified in FILTER and then store the information required to access those instances to disk
> cargo windsock --store-cloud-resources-to-disk FILTER
Creating AWS resources: CloudResourcesRequired {
    shotover_instance_count: 1,
    docker_instance_count: 3,
    include_shotover_in_docker_instance: false,
}
> # Run the benches once, using the instances created in the previous command.
> cargo windsock --use-cloud-resources-from-disk FILTER
Running "kafka,shotover=standard,size=100KB,topology=cluster3"
...
> # Run the benches a second time reusing the same instances
> cargo windsock --use-cloud-resources-from-disk FILTER
Running "kafka,shotover=standard,size=100KB,topology=cluster3"
...
> # Cleanup resources, also remove the resources-to-disk file to ensure that a `--use-cloud-resources-from-disk` command would fail early.
> cargo windsock --cleanup-cloud-resources
All AWS throwaway resources have been deleted

After that is implemented it should be easier to evaluate #1360

shotover / shotover-proxy

windsock: get stable results #1274

cassandra,compression=none,driver=scylla,operation=read_i64,protocol=v4,shotover=none,topology=single observations

cassandra,compression=none,driver=scylla,operation=read_i64,protocol=v4,shotover=standard,topology=cluster3 observations