[BUG] Occasional variance seen in running benchmarks on OpenSearch on same hardware over time

Describe the bug

Recently, we carried out some runs with OpenSearch Benchmark against various OpenSearch versions (2.13 and 2.14). While the overall performance was consistent, there have been occasions where the performance had variance above the acceptable threshold.

We explored if the variance was caused by OpenSearch Benchmark (the client side), the OpenSearch cluster (server side), as well as the hardware used in both. Additionally, we inspected the lucene-level segments to see if the occasional variance was caused by any discrepancies in segment count (caused by force-merges and random merges) and switched to restoring the data directory so that the number of segments remained the same.

Opening this issue in this repository until we discover what the root cause might be. Please let me know if there's a better tag for this. Looping in @jed326 as he has been looking into variance seen in concurrent & non-concurrent scenarios for some term queries and there seems to be overlap in theories.

More Details on Testing Setup and Results

In May, we provisioned a single node OpenSearch 2.13 cluster and ran OSB for four iterations, took the arithmetic mean across the iterations, and calculated the relative standard deviations between iterations. About a month later, on the same cluster, we reran another round with four iterations each. We discovered that the same set of test configurations on the same cluster produced different RSD values.

For example, here we are comparing the variance between two operations from two separate runs -- one from May and one from June -- on the same cluster.

Operation	May Service Time RSD (%)	June Service Time RSD (%)
multi_terms-keyword	1.69	27.67

Related component

Search:Performance

To Reproduce

Create load generation hosts (with enough resources that do not bottleneck the tests)
Install OpenSearch Benchmark onto the load generation hosts
Run a workload with diverse set of queries (such as Big5 workload) and run it for several iterations
Calculate arithmetic means and relative standard deviations (RSD) for all operations across all iterations. To do this, you can use scripts such as the ones I've used here

Expected behavior

As opposed to spinning up new hardware each time we run a test, we're using the same hardware and should expect to see the same variance. However, we're seeing occasional variance between iterations and between runs.

Host/Environment (please complete the following information):

Amazon Linux 2
OpenSearch Version 2.13 & 2.14

Thanks for opening this @IanHoang!

In the specific scenario you're describing it sounds like you're talking about an increase in RSD over 4 data points when compared ~1 month apart. I'm not sure if 4 data points is enough to make determination on if the variance is truly different between those time periods and obviously we can't go back in time to take more samples.

That being said, just looking at the recently published big5 nightly performance numbers here&_a=(description:'',filters:!(),fullScreenMode:!f,options:(hidePanelTitles:!f,useMargins:!t),query:(language:kuery,query:''),timeRestore:!t,title:Big5_final,viewMode:view)&show-time-filter=true&hide-filter-bar=true) we can definitely see a lot of what you're describing with respect to variance. Even if we look only at 2.14.0 numbers where the underlying code is not changing we can still see ~20% variance in service time day to day for some of the operations. This is quite strange as the only thing that should be changing between these runs is the underlying hardware the cluster is running on! Moreover, if you compare the p90 vs. p50 of these operations on those days, they actually track each other quite closely so what that means is the variance on any given day is not that high, yet somehow the variance across days is quite high.

I think in order to investigate this we would probably need some platform side enhancements. Off the top of my head I'm thinking:

Longevity tests tracking the latest released version
Nightly benchmarking runs for stable versions should be run on the same cluster. For example each day the 2.14 benchmarking runs should re-use the cluster from the day before.
Profiling metrics for nightly runs (probably starting with the query profiler but network related profilling would also be helpful)

opensearch-project / OpenSearch