Open IanHoang opened 5 months ago
Thanks for opening this @IanHoang!
In the specific scenario you're describing it sounds like you're talking about an increase in RSD over 4 data points when compared ~1 month apart. I'm not sure if 4 data points is enough to make determination on if the variance is truly different between those time periods and obviously we can't go back in time to take more samples.
That being said, just looking at the recently published big5 nightly performance numbers here&_a=(description:'',filters:!(),fullScreenMode:!f,options:(hidePanelTitles:!f,useMargins:!t),query:(language:kuery,query:''),timeRestore:!t,title:Big5_final,viewMode:view)&show-time-filter=true&hide-filter-bar=true) we can definitely see a lot of what you're describing with respect to variance. Even if we look only at 2.14.0
numbers where the underlying code is not changing we can still see ~20% variance in service time day to day for some of the operations. This is quite strange as the only thing that should be changing between these runs is the underlying hardware the cluster is running on! Moreover, if you compare the p90 vs. p50 of these operations on those days, they actually track each other quite closely so what that means is the variance on any given day is not that high, yet somehow the variance across days is quite high.
I think in order to investigate this we would probably need some platform side enhancements. Off the top of my head I'm thinking:
Describe the bug
Recently, we carried out some runs with OpenSearch Benchmark against various OpenSearch versions (2.13 and 2.14). While the overall performance was consistent, there have been occasions where the performance had variance above the acceptable threshold.
We explored if the variance was caused by OpenSearch Benchmark (the client side), the OpenSearch cluster (server side), as well as the hardware used in both. Additionally, we inspected the lucene-level segments to see if the occasional variance was caused by any discrepancies in segment count (caused by force-merges and random merges) and switched to restoring the data directory so that the number of segments remained the same.
Opening this issue in this repository until we discover what the root cause might be. Please let me know if there's a better tag for this. Looping in @jed326 as he has been looking into variance seen in concurrent & non-concurrent scenarios for some term queries and there seems to be overlap in theories.
More Details on Testing Setup and Results
In May, we provisioned a single node OpenSearch 2.13 cluster and ran OSB for four iterations, took the arithmetic mean across the iterations, and calculated the relative standard deviations between iterations. About a month later, on the same cluster, we reran another round with four iterations each. We discovered that the same set of test configurations on the same cluster produced different RSD values.
For example, here we are comparing the variance between two operations from two separate runs -- one from May and one from June -- on the same cluster.
Related component
Search:Performance
To Reproduce
Expected behavior
As opposed to spinning up new hardware each time we run a test, we're using the same hardware and should expect to see the same variance. However, we're seeing occasional variance between iterations and between runs.
Host/Environment (please complete the following information):