Open kkewwei opened 1 week ago
Thanks, @kkewwei. @kkhatua, you mentioned you have some more information here about ARS.
@kkewwei Adaptive Replica Selection relies on the average search latency per node as a basis on how to route shard level traffic. As an example, let's assume there are 3 nodes with 2 indices X (1 primary / 2 replicas) and Y (1 primary no replica) - A, B and C ... and node A is hosting the solo shard for index Y which typically expects a heavy query, while X has a copy of its solo shard on ALL the nodes.
Any traffic for index Y will skew node A's average latency being higher than the average latencies for B and C.
Assuming request traffic is evenly routed to a separate set of node - P (which act as coordinator), ARS will show less preference for routing shards to node A due to the higher average latency as compared to B and C. The algorithm doesn't look at CPU because it is looking at latency as a proxy for whether a node is overloaded or not. Hence, I wouldn't quote this as a bug, though there are potential (and unrelated) enhancements in improving ARS.
If you did a similar comparison of the latency for the nodes, you'll notice that while ARS was enabled the search latencies will be more closely aligned for all the nodes. Could you share that for comparison as well?
node
@kkhatua, thank you for you reply.
In your case, if I disable the ARS, the latency&cpu of A will be higher. I can't get detailed in out product, will test on my own test cluster, and acquire more metrics.
Describe the bug
In our product, we found that the CPU usage of the nodes is very uneven, but shard allocation is relatively uniform, when we close the ars, the cpu usage become relatively stable.
After putting the settings, the result is as follows:
Related component
Search:Relevance
To Reproduce
There seems something wrong with the search ars, I can't reproduce it, and will continue to follow up on this reason.
Expected behavior
no
Additional Details
Plugins Please list all plugins currently enabled.
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context Add any other context about the problem here.