opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.47k stars 1.74k forks source link

[RFC] Backpressure in Search Path #1329

Open tushar-kharbanda72 opened 2 years ago

tushar-kharbanda72 commented 2 years ago

Introduction

BackPressure in Search Path aims to enhance the overall resiliency of OpenSearch. The current protection mechanisms on OS nodes such as ThreadPoolQueueSize and CircuitBreakers are not fully efficient to protect the cluster against traffic surge, partial failures, slow node or a single rogue (resource-guzzling) query. Search BackPressure aims to introduce constructs to have fair rejections, minimal wastage of useful work done, search request cost estimation and ability to stabilise the cluster when under duress.

Problem Statement

In OpenSearch, for Search requests there are few gating mechanisms to prevent a node from meltdown when under a duress situation. These are essentially queue rejections and circuit breakers. However, these gating mechanism are all static in limits, take local decisions, and are often too late to act upon due to configuration issues.

Problem with existing constructs

Other problems which OS users can run into

Proposed Solution

Summary

We propose to introduce BackPressure framework in the search path to address the above concerns and improve the overall resiliency of the OpenSearch cluster. For this we’ll introduce resource tracking framework for end to end memory/cpu tracking of Search request, covering different phases of execution across all shard requests. Adaptive replica selection will consider these resource metrics from different nodes, while ranking the nodes to reduce requests on target nodes already running into resource contention issues. Framework will also use the same resource utilisation metrics to take decisions whether to reject or cancel an in-flight request, with added fairness. Task level prioritizations based on current state of request can be achieved, such as preferring requests already in fetch phase over those yet to be picked for query phase. This will help stabilising the node under duress by completing requests faster which has some useful work already done. In addition to this, framework would have a Search request cost-estimation model to support a pro-active form of backpressure, which takes decision on incoming requests/tasks, if a node will be able to serve the query of certain cost or not. We have divided this proposal into 3 milestones as follows:

Milestone 1

Goals

1.1 Resource Tracking Framework

Build a resource tracking framework for Search requests at both Rest level (Client requests) and Transport level (Shard search requests), which tracks resource consumption on OpenSearch nodes, for various search operations at different level of granularity:

Characteristics:

1.2 Adaptive replica Selection - Factor in node resource utilisation

Currently, for Adaptive replica selection the tasks in search thread-pool queue are factored in while coming up with the ranking for the target shards. Due to the fact that cost of each search request can vary significantly the queue count doesn’t accurately tells what more resources the node have to accept and complete more search requests. To be more accurate on this we can factor in the resource utilisation for search requests on these nodes, and take better routing decisions on the coordinator.

Milestone 2

Goals

2.1 Server Side rejection of in-coming search requests

Currently, Search rejections are solely based on the number of tasks in queue for Search ThreadPool. That doesn’t provide fairness in rejections as multiple smaller queries can exhaust this limit but are not resource intensive and the node can take in much more requests and vice versa. Essentially count is not the reflection of actual work.

Hence based on metrics in point 1.1 above, we want to build a frame which can perform more informed rejections based on point in time resource utilisation. The new model will take the admission decision for search requests on the node. These admission decisions or rejection limits can have different levels to it:

This can be further evolved to support Shard level priority model, where user can set priority on an index or every request, so that framework can consume them for taking admission/rejection decisions.

If user has configured partial results to be true, then upon these rejections and Coordinator’s inability to retry the request on another shard on a different node might result in user’s getting partial response.

The above will provide the required isolation of accounting and fairness in the rejections which is currently not there. This is still a reactive back-pressure mechanism as it only focusses on the current consumption and does not estimate the future work which is to be done for these search requests.

2.2 Server side Cancellation of in-flight search requests based on resource consumption

This is the 3rd level which kicks in after we’re cancelling all search request coming to a node. Here, we take decision to cancel on-going requests, If the resource limits for that shard/node have started breaching the assigned limits (point 2.1), and there is no recovery seen for a certain time threshold. The BackPressure model should support identification of queries which are most resource guzzling with minimal wasteful work. These can then be cancelled for recovering a node under load and continue doing useful work.

Milestone 3

Goals

3.1 Query resource consumption estimation

Improve the framework added as part of (point 2) to also estimate the cost of new search queries, based on their potential resource utilisation. It can be achieved by looking at the past query patterns, with similar query constructs and the actual data on shard. This will help in building a pro-active back-pressure model for search requests, where estimates will be compared against the available resources during the admission decision for granular control.

reta commented 2 years ago

Thanks for bringing this up, @tushar-kharbanda72 , there are not doubts that applying back pressure on the search flows would lead to more stable clusters. There are quite a few successful implementations with respect to the proposal, fe Akka's Adaptive Load Balancing (https://doc.akka.io/docs/akka/current/cluster-metrics.html#adaptive-load-balancing), which to some extent addresses the same problem (could be good case study).

For this we’ll introduce resource tracking framework for end to end memory/cpu tracking of Search request, covering different phases of execution across all shard requests.

I assume the only way to claim that node uses all resources only for search is when it has a single data role (or alike). In many deployments this is not the case, and the same nodes may be involved in ingestion or/and cluster coordination, etc. How would the backpressure / resource tracking framework deal with that?

Also, should the resource tracking framework consult the overall node load (memory/cpu at least)? The co-located deployments sadly are not that rare :(

Task level prioritizations based on current state of request can be achieved, such as preferring requests already in fetch phase over those yet to be picked for query phase.

I think in general it makes a lot of sense. But here is the counter example (seen in the wild): the query phase take a very long time, following the fetch phase. At that moment, the client has already lost any hope to get the response (may be even gone), but the server is still working on it, the live queries are going to be rejected because the fetch phase for this "dead" query is still ongoing. Would it make sense to take into account the overall age of the query (when kicking off the fetch phase) to weight its completion relevance to the client?

Improve the framework added as part of (point 2) to also estimate the cost of new search queries, based on their potential resource utilisation.

The approach(es) you have mentioned make sense. Probably it would be good to invest into full-fledges query planner which will assign costs to each query element and overall query at the end? May not be easy, for sure, but the iterative approach could be taken to refine it along the way.

Thank you.

asafm commented 2 years ago

I think in general it makes a lot of sense. But here is the counter example (seen in the wild): the query phase take a very long time, following the fetch phase. At that moment, the client has already lost any hope to get the response (may be even gone), but the server is still working on it, the live queries are going to be rejected because the fetch phase for this "dead" query is still ongoing. Would it make sense to take into account the overall age of the query (when kicking off the fetch phase) to weight its completion relevance to the client?

Perhaps we can check if the client connection is still active, and if not, propagate it to the data nodes to cancel that query.

reta commented 2 years ago

Perhaps we can check if the client connection is still active, and if not, propagate it to the data nodes to cancel that query. @asafm I think something along these lines, for example interpreting client disconnects as tasks cancellation and propagating that across the nodes (since coordinator is aware of client state and tasks).

dblock commented 2 years ago

This is a good proposal 👍. Resource consumption is great, but it's a lot of work, so I would in parallel consider simpler approaches.

sruti1312 commented 2 years ago

Hi @CEHENKLE This issue requires active collaboration from multiple developers and performance testing before merging to main branch. Can you help with creating a feature branch for this feature/task-resource-tracking? Related issue: https://github.com/opensearch-project/OpenSearch/issues/1009

CEHENKLE commented 2 years ago

Done! https://github.com/opensearch-project/OpenSearch/tree/feature/task-resource-tracking

tushar-kharbanda72 commented 2 years ago

Thanks @dblock for going through the proposal. We have the PRs related to task resource tracking in progress and that should complete soon. Targeting march for that.

Soon we'll fast follow that up with initial low effort rejection strategies which you and others mentioned which we'll include as part of Milestone 2.

Milestone 3 is a lot of work and we'll have to evaluate the need and criticality after we have initial improvements merged in.

getsaurabh02 commented 2 years ago

Adding Meta Issue and the child issues opened previously to this RFC in order to track deliverables: https://github.com/opensearch-project/OpenSearch/issues/1042

anasalkouz commented 2 years ago

@tushar-kharbanda72 @getsaurabh02 since you have META issue to track the progress of this initiative. Could you close the RFC? and what is the target release for this project.

elfisher commented 2 years ago

Is this still on track for 2.2?

rramachand21 commented 2 years ago

This (milestone 2) will come in 2.3 - we are merging in the changes for resource tracking framework in 2.2 (milestone 1)

elfisher commented 2 years ago

Thanks @rramachand21!

elfisher commented 2 years ago

@rramachand21 is there an issue cut for milestone 2? I'd like to sync it up with the related documentation issue.

JeffHuss commented 2 years ago

@rramachand21 please provide a feature document for this so we can get it documented if it's still on track for 2.3. The docs issue is https://github.com/opensearch-project/documentation-website/issues/795

Currently I'm blocked.

anasalkouz commented 1 year ago

@tushar-kharbanda72 is this still on track for 2.4 release? code freeze on 11/3 Is there anything pending? otherwise, feel free to close it.

anasalkouz commented 1 year ago

@rramachand21 is this on track for 2.4 release? code freeze today

minalsha commented 1 year ago

@rramachand21 Is there anything pending on this? If not, can we close this issue since 2.4 is already in production and update with latest version that you are tracking?

kartg commented 1 year ago

@rramachand21 is this on track for 2.6.0 release? The code freeze for this is today (Feb 21, 2023)

anasalkouz commented 1 year ago

Hey @tushar-kharbanda72, is this feature still on track for 2.7? The code freeze date is April 17, 2023.

rramachand21 commented 1 year ago

@ketanv3 could you please update the latest on this?

DarshitChanpura commented 1 year ago

Hi @tushar-kharbanda72 , This issue will be marked for next-release v2.8.0 on (Apr 17) as that is the code-freeze date for v2.7.0. Please let me know if otherwise.

DarshitChanpura commented 1 year ago

Tagging it to next release: v2.8.0