opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.64k stars 1.77k forks source link

[Performance] Adaptive Replica Selection + Shard Idle introduces inconsistent search performance #1536

Open nknize opened 2 years ago

nknize commented 2 years ago

Adaptive Replica Selection (ARS) is a default feature where search requests are routed to an optimal node based on a computed "Node Score" (function of queue size, response time, and service time). The intent of the behavior is to avoid routing search requests to busy nodes with the idea of improving search response time.

Shard Idle is a default feature where shards that have not received a search request after 30s transition to an "idle" state and all future refreshes are skipped unless there are any registered refresh listeners.

Ironically, putting the two "optimization" features together (which are both enabled by default) introduces an interesting periodic degradation in search performance.

If a node is experiencing heavy resource usage such that the ARS computed "score" prevents search requests from being routed longer than 30s then indexes will transition to "search idle" and future refreshes will be skipped. Once the node's resources are finally freed and a search request is received the search request may be queued until the refresh completes again. If an index has received a significant amount of updates (inserts, deletes, etc) this search request could be put on hold for quite some time. If the search request time_out is set below the actual refresh completion time, then the shard may time out. Either way, this situation introduces a non-negligible performance hit.

This issue is open to determine if there is an improvement that can be made in these scenarios or if the best course of action is to update the documentation to simply recommend users explicitly set the index.refresh_interval to prevent shards from ever entering the idle state if "real time" search is a QoS requirement.

While this has been seen before the user never fully understood the reasoning for seemingly "random" search degradation. What's more concerning are the number of users that may be unaware of this degradation.

andrross commented 2 years ago

I'll note that I'm very new to these features, but the issue is tagged with the "discuss" label so I'd figure I'd share my thoughts :)

Given that the stated goal of Shard Idle is to "automatically optimize bulk indexing in the default case when no searches are performed" then it seems like the root of the problem here is that the node made the decision to transition its shards to "search idle" based purely off of local information when in fact those shards were still taking search requests (they were just going to other nodes). Is that right? Is there a way for a node to know or have a heuristic to suggest that it is consistently losing requests due to its Adaptive Replica Selection score and therefore defer any Shard Idle transitions?

That being said, it seems like you will inherently trade off some amount of performance consistency by enabling Shard Idle, so the suggestion to recommend disabling it when consistency is important doesn't seem unreasonable.

nknize commented 2 years ago

Is there a way for a node to know or have a heuristic to suggest that it is consistently losing requests due to its Adaptive Replica Selection score and therefore defer any Shard Idle transitions?

Hmm.. this would need to be added to the coordinating node logic.

It would be interesting to add something like lastSearchRequestNodeTime to the routing table and when computing ARS send a "ping" to all nodes pending idle transition nodes if lastSearchRequestTime - lastSearchRequestNodeTime is between some time window (e.g., < 30s > 15s). Of course this adds overhead when recommending an explicit refresh_interval achieves something similar. ¯\(ツ)

aabukhalil commented 2 years ago

I'm looking into this.

aabukhalil commented 2 years ago

I think that Shard Idle feature by itself is causing inconsistent search performance even without the adaptive replica selection (when cluster has low search traffic or when _routing is used and that happened to me in the past and I had hard time figuring out that explicitly setting refresh_interval solve high percentile latency ).

With combination of Adaptive Replica Selection, it is causing same effect even when cluster has consistent traffic so it is a another side effect of Shard Idle feature which was created to enhance indexing throughput and trade it off with search latency.

Since Shard Idle is a trade off feature where some users need it and others don't, I think that explicitly documenting this trade off will work for all and let users decide based on their requirements. I think optimizing Adaptive Replica Selection to take into consideration latest search time per node per shard globally (across all coordinator nodes) or locally (each coordinator node has it's view on latest search time which will still keep room for latency randomness depending on which coordinator node serving the request? ) will help reducing the effect of Shard Idle when cluster is not really "idle" but won't help when the cluster has low traffic so I'm not sure if we should invest in this optimization. What do you think ?

Additionally, Shard Idle feature will affect the freshness of data and the randomness of latency in the ongoing effort of Segment Replication, which in a nutshell uses primary shards refreshes to trigger the segment replication to replicas. When segment replication is enabled then only the refresh of the primary shard is what matter and if shard idle is enabled and primary shard goes into idle, we might have these cases:

Enabling Adaptive Replica Selection while Shard Idle feature is on and segment replication is on might cause more interesting effect because Adaptive Replica Selection takes into consideration queueSize, responseDuration and serviceTimeEWMA when computing ranks and these factors might be affected for node which holds the primary shard because it has more overhead ? that need to be confirmed. If this is the case then adaptive replica selection will route search requests away from primary shard which will impact data freshness because refresh might not be triggered until OpenSearch's flush threshold is crossed. This might be still beneficial for indexing throughput and some users might need it.

So here is my suggestion:

What do you think @nknize, @andrross and @mch2 ?