tushar-kharbanda72 commented 2 years ago

Introduction

BackPressure in Search Path aims to enhance the overall resiliency of OpenSearch. The current protection mechanisms on OS nodes such as ThreadPoolQueueSize and CircuitBreakers are not fully efficient to protect the cluster against traffic surge, partial failures, slow node or a single rogue (resource-guzzling) query. Search BackPressure aims to introduce constructs to have fair rejections, minimal wastage of useful work done, search request cost estimation and ability to stabilise the cluster when under duress.

Problem Statement

In OpenSearch, for Search requests there are few gating mechanisms to prevent a node from meltdown when under a duress situation. These are essentially queue rejections and circuit breakers. However, these gating mechanism are all static in limits, take local decisions, and are often too late to act upon due to configuration issues.

Problem with existing constructs

ThreadPool queue size: OpenSearch users today can provide a node level settings to fix the number of tasks that can be queued up for search in the Search ThreadPool queue. With this we see 2 major problems:
- Queue size doesn’t define actual work to be done by the tasks it contains. A query can have a very high cardinality and might consume a lot of system resources. While other times query could be related to smaller indices, touching very few shards and relatively less data. Query type also plays a huge role.
- Tuning the queue size based on workload adds a major management overhead. It is also hard to keep it consistent with the change in traffic. Since every customer has different query patterns, coming up with defaults which fits for all is not possible.
Circuit breakers: Although we have circuit breakers which track real time memory resource utilisation of requests, they act at a blanket level. It provides mechanism to cut off requests if the defined thresholds are breached. However this lacks fairness, as decision is limited to local node performance, and they lack view on the hot shards or request types as well. Also, Circuit breakers lack prioritizations, as they are unable to expedite requests which are more likely to be completed soon. For example, requests in the fetch phase have more chances of finishing early compared to query phase. Arriving at the right thresholds is again a management overhead for users.

Proposed Solution

Summary

We propose to introduce BackPressure framework in the search path to address the above concerns and improve the overall resiliency of the OpenSearch cluster. For this we’ll introduce resource tracking framework for end to end memory/cpu tracking of Search request, covering different phases of execution across all shard requests. Adaptive replica selection will consider these resource metrics from different nodes, while ranking the nodes to reduce requests on target nodes already running into resource contention issues. Framework will also use the same resource utilisation metrics to take decisions whether to reject or cancel an in-flight request, with added fairness. Task level prioritizations based on current state of request can be achieved, such as preferring requests already in fetch phase over those yet to be picked for query phase. This will help stabilising the node under duress by completing requests faster which has some useful work already done. In addition to this, framework would have a Search request cost-estimation model to support a pro-active form of backpressure, which takes decision on incoming requests/tasks, if a node will be able to serve the query of certain cost or not. We have divided this proposal into 3 milestones as follows:

Milestone 1

Goals

Aims to achieve visibility into resource utilisation of Search requests (Both at Coordinator and Data node)
Fan-out the search request to nodes having higher chances of completing the request

1.1 Resource Tracking Framework

Build a resource tracking framework for Search requests at both Rest level (Client requests) and Transport level (Shard search requests), which tracks resource consumption on OpenSearch nodes, for various search operations at different level of granularity:

Individual Search Request (Rest) - On the Coordinator Node across phases (such as query and fetch phase) for end to end resource tracking from coordinator perspective.
Shard Search Requests (Transport) - On the Data Node per phase, for discrete search task tracking.
Shard level Aggregated View - Total consumption of resources mapped to every shard for searches on the node.
Node level Aggregated View - Total consumption of resources for all search request on the node.

Characteristics:

Resources to track for each of the above proposed metrics:
- Memory Consumption (JVM)
- CPU Consumption (CPU Time)
Frequent checkpointing and footprint update during the progress of a search phase : The resource tracking should continuously be done as the search request progresses, and need not necessarily wait for a particular phase to complete. This is important as a single phase execution can consume significant amount of resources itself. So, tracking resources within a phase itself as it progresses becomes important for these metrics to represent the accurate state on the node.
Data Node feedback to build Coordinator state : Have a capability for data node to piggyback on the response and send its current node utilisation state due to search requests to coordinator.

1.2 Adaptive replica Selection - Factor in node resource utilisation

Currently, for Adaptive replica selection the tasks in search thread-pool queue are factored in while coming up with the ranking for the target shards. Due to the fact that cost of each search request can vary significantly the queue count doesn’t accurately tells what more resources the node have to accept and complete more search requests. To be more accurate on this we can factor in the resource utilisation for search requests on these nodes, and take better routing decisions on the coordinator.

This will help avoiding the shards under stress giving them time to recover
Increase the chances of request landing on the node which can complete it
Queueing and Fast-Fail decision can be taken on the Coordinator node itself if required.

Milestone 2

Goals

Improved fairness in Search request rejections
Reducing chances of node getting overwhelmed due to Search request load
Ability to stabilise overloaded nodes by identifying and cancelling resource guzzling queries.
Achieving resiliency with reduced dependency on Circuit breaker and Threadpool queue configurations as the accuracy of rejections due to these depends on user input.

2.1 Server Side rejection of in-coming search requests

Currently, Search rejections are solely based on the number of tasks in queue for Search ThreadPool. That doesn’t provide fairness in rejections as multiple smaller queries can exhaust this limit but are not resource intensive and the node can take in much more requests and vice versa. Essentially count is not the reflection of actual work.

Hence based on metrics in point 1.1 above, we want to build a frame which can perform more informed rejections based on point in time resource utilisation. The new model will take the admission decision for search requests on the node. These admission decisions or rejection limits can have different levels to it:

Level 1: At this point system has detected overload due to search requests and it’ll prioritise which requests to accept. Example: It’ll accept fetch requests over Query requests as for Fetch phase we have already done some work to reach at this point whereas Query is going to be more resource intensive and have least wastage of work if rejected. Similar logic can be applied for Force search requests as well.
Level 2: At this point we’ll start rejecting all search requests beyond capacity to prevent any impact on the availability of node.

This can be further evolved to support Shard level priority model, where user can set priority on an index or every request, so that framework can consume them for taking admission/rejection decisions.

If user has configured partial results to be true, then upon these rejections and Coordinator’s inability to retry the request on another shard on a different node might result in user’s getting partial response.

The above will provide the required isolation of accounting and fairness in the rejections which is currently not there. This is still a reactive back-pressure mechanism as it only focusses on the current consumption and does not estimate the future work which is to be done for these search requests.

2.2 Server side Cancellation of in-flight search requests based on resource consumption

This is the 3rd level which kicks in after we’re cancelling all search request coming to a node. Here, we take decision to cancel on-going requests, If the resource limits for that shard/node have started breaching the assigned limits (point 2.1), and there is no recovery seen for a certain time threshold. The BackPressure model should support identification of queries which are most resource guzzling with minimal wasteful work. These can then be cancelled for recovering a node under load and continue doing useful work.

Milestone 3

Goals

Have Search request cost estimation capability
Have improved capability to take request admission decision based on the request cost and available resources (Think Token buckets). (Refer Section 3.2 C3 paper )

3.1 Query resource consumption estimation

Improve the framework added as part of (point 2) to also estimate the cost of new search queries, based on their potential resource utilisation. It can be achieved by looking at the past query patterns, with similar query constructs and the actual data on shard. This will help in building a pro-active back-pressure model for search requests, where estimates will be compared against the available resources during the admission decision for granular control.

reta commented 2 years ago

Thanks for bringing this up, @tushar-kharbanda72 , there are not doubts that applying back pressure on the search flows would lead to more stable clusters. There are quite a few successful implementations with respect to the proposal, fe Akka's Adaptive Load Balancing (https://doc.akka.io/docs/akka/current/cluster-metrics.html#adaptive-load-balancing), which to some extent addresses the same problem (could be good case study).

For this we’ll introduce resource tracking framework for end to end memory/cpu tracking of Search request, covering different phases of execution across all shard requests.

I assume the only way to claim that node uses all resources only for search is when it has a single data role (or alike). In many deployments this is not the case, and the same nodes may be involved in ingestion or/and cluster coordination, etc. How would the backpressure / resource tracking framework deal with that?

Also, should the resource tracking framework consult the overall node load (memory/cpu at least)? The co-located deployments sadly are not that rare :(

Task level prioritizations based on current state of request can be achieved, such as preferring requests already in fetch phase over those yet to be picked for query phase.

I think in general it makes a lot of sense. But here is the counter example (seen in the wild): the query phase take a very long time, following the fetch phase. At that moment, the client has already lost any hope to get the response (may be even gone), but the server is still working on it, the live queries are going to be rejected because the fetch phase for this "dead" query is still ongoing. Would it make sense to take into account the overall age of the query (when kicking off the fetch phase) to weight its completion relevance to the client?

Improve the framework added as part of (point 2) to also estimate the cost of new search queries, based on their potential resource utilisation.

The approach(es) you have mentioned make sense. Probably it would be good to invest into full-fledges query planner which will assign costs to each query element and overall query at the end? May not be easy, for sure, but the iterative approach could be taken to refine it along the way.

Thank you.

asafm commented 2 years ago

I think in general it makes a lot of sense. But here is the counter example (seen in the wild): the query phase take a very long time, following the fetch phase. At that moment, the client has already lost any hope to get the response (may be even gone), but the server is still working on it, the live queries are going to be rejected because the fetch phase for this "dead" query is still ongoing. Would it make sense to take into account the overall age of the query (when kicking off the fetch phase) to weight its completion relevance to the client?

Perhaps we can check if the client connection is still active, and if not, propagate it to the data nodes to cancel that query.

reta commented 2 years ago

Perhaps we can check if the client connection is still active, and if not, propagate it to the data nodes to cancel that query. @asafm I think something along these lines, for example interpreting client disconnects as tasks cancellation and propagating that across the nodes (since coordinator is aware of client state and tasks).

dblock commented 2 years ago

This is a good proposal 👍. Resource consumption is great, but it's a lot of work, so I would in parallel consider simpler approaches.

hard limit on number of items in the search queue (exists today) cancels existing queued items (if queue is full, flush it, cancel existing requests, cool down)
general resource consumption (e.g. cancel all existing operations and cool down if overall CPU is too high over a window)
evaluating queries and measuring their cost (heavy queries or queries known to have high execution cost get rejected without being executed)
time in queue over a window (new? overall search duration is increasing)

sruti1312 commented 2 years ago

Hi @CEHENKLE This issue requires active collaboration from multiple developers and performance testing before merging to main branch. Can you help with creating a feature branch for this feature/task-resource-tracking? Related issue: https://github.com/opensearch-project/OpenSearch/issues/1009

CEHENKLE commented 2 years ago

Done! https://github.com/opensearch-project/OpenSearch/tree/feature/task-resource-tracking

tushar-kharbanda72 commented 2 years ago

Thanks @dblock for going through the proposal. We have the PRs related to task resource tracking in progress and that should complete soon. Targeting march for that.

Soon we'll fast follow that up with initial low effort rejection strategies which you and others mentioned which we'll include as part of Milestone 2.

Milestone 3 is a lot of work and we'll have to evaluate the need and criticality after we have initial improvements merged in.

getsaurabh02 commented 2 years ago

Adding Meta Issue and the child issues opened previously to this RFC in order to track deliverables: https://github.com/opensearch-project/OpenSearch/issues/1042

anasalkouz commented 2 years ago

@tushar-kharbanda72 @getsaurabh02 since you have META issue to track the progress of this initiative. Could you close the RFC? and what is the target release for this project.

elfisher commented 2 years ago

Is this still on track for 2.2?

rramachand21 commented 2 years ago

This (milestone 2) will come in 2.3 - we are merging in the changes for resource tracking framework in 2.2 (milestone 1)

elfisher commented 2 years ago

Thanks @rramachand21!

elfisher commented 2 years ago

@rramachand21 is there an issue cut for milestone 2? I'd like to sync it up with the related documentation issue.

JeffHuss commented 2 years ago

@rramachand21 please provide a feature document for this so we can get it documented if it's still on track for 2.3. The docs issue is https://github.com/opensearch-project/documentation-website/issues/795

Currently I'm blocked.

anasalkouz commented 1 year ago

@tushar-kharbanda72 is this still on track for 2.4 release? code freeze on 11/3 Is there anything pending? otherwise, feel free to close it.

anasalkouz commented 1 year ago

@rramachand21 is this on track for 2.4 release? code freeze today

minalsha commented 1 year ago

@rramachand21 Is there anything pending on this? If not, can we close this issue since 2.4 is already in production and update with latest version that you are tracking?

kartg commented 1 year ago

@rramachand21 is this on track for 2.6.0 release? The code freeze for this is today (Feb 21, 2023)

anasalkouz commented 1 year ago

Hey @tushar-kharbanda72, is this feature still on track for 2.7? The code freeze date is April 17, 2023.

rramachand21 commented 1 year ago

@ketanv3 could you please update the latest on this?

DarshitChanpura commented 1 year ago

Hi @tushar-kharbanda72 , This issue will be marked for next-release v2.8.0 on (Apr 17) as that is the code-freeze date for v2.7.0. Please let me know if otherwise.

DarshitChanpura commented 1 year ago

Tagging it to next release: v2.8.0

opensearch-project / OpenSearch

[RFC] Backpressure in Search Path #1329

Introduction

Problem Statement

Problem with existing constructs

Other problems which OS users can run into

Proposed Solution

Summary

Milestone 1

Goals

1.1 Resource Tracking Framework

1.2 Adaptive replica Selection - Factor in node resource utilisation

Milestone 2

Goals

2.1 Server Side rejection of in-coming search requests

2.2 Server side Cancellation of in-flight search requests based on resource consumption

Milestone 3

Goals

3.1 Query resource consumption estimation