opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.43k stars 1.73k forks source link

[RFC] Shard Indexing backpressure mechanism should also protect from any CPU contention on nodes #7638

Open CaptainDredge opened 1 year ago

CaptainDredge commented 1 year ago

Is your feature request related to a problem? Please describe. Currently shard Indexing pressure rejection strategy is determined by the node and shard level limit of memory occupancy along with performance parameter like throughput limits on shard level. Thus, current indexing backpressure only accounts for memory bottlenecks and may not protect a system under duress due to cpu contention from high indexing workload. This can lead to node drops and cluster unavailability over time since there may not be enough cpu resource to fulfil essential cluster tasks like responding to health checks etc. This demands for cpu resource utilisation to be one of the trigger point of indexing backpressure

Describe the solution you'd like Currently, Shard indexing pressure has one hard limit parameter Node Heap Memory Occupancy Hard Limit - Whenever the current memory occupancy at node level becomes equal to 100% of assigned node memory this parameter is assumed to be breached.

Node heap memory hard limit protects from memory exhaustion on the node, similarly we should add a hard limit for cpu utilization in order to protect the node from sustained high cpu usage

Node CPU Utlization Hard Limit - Whenever the currently tracked cpu utilisation at node level becomes more than 95% this parameter is assumed to be breached

Currently indexing pressure has two primary parameters both of which are based on memory occupancy of node and shard respectively

To account for cpu contention on node we will be adding a third primary parameter

The final Rejection Rules will become

  1. Current Node Bytes is greater than Assigned Node Memory.
  2. Node CPU Utlization hard limit is breached
  3. Shard Memory Occupancy is greater than 95% of assigned Shard Memory and Current Node Bytes is greater than 70% of assigned Node Memory. Then if Last Successful Request Threshold Breached or Throughput Degradation Breached.
  4. CPU utilization soft limit is breached on the node. Then if Last Successful Request Threshold Breached or Throughput Degradation

For CPU usage tracking we can have a component CPUResourceWatcher which will be a scheduled task and will be responsible to take snapshots at regular intervals (5 secs). It’ll update the average CPU % utilization in a configurable sliding window. Length of sliding window determines how quickly will indexing pressure reacts to the changes in cpu utilization. We can provide a dynamic setting UtilizationResponsiveness which can be set to Low, Medium or High. Each utilization responsiveness will have a default sliding window length for eg. 10, 15 and 30s for high, medium and low respectively. Exact values for sliding window length will be decided after performance benchmarking. Lower responsiveness i.e. longer sliding window will respond slowly to changes in utilization. A sustained increase in utilization over a sufficiently long period of time will trigger indexing pressure. This assumes that frequent spikes in utilization are benign and do not affect overall health of node. Whereas if we desire a fast response to any sudden increase in utilization then HIgh responsiveness will be better.

I am looking for feedback from the community to evolve this feature from an idea to concrete proposal.

CaptainDredge commented 1 year ago

Tagging relevant folks to get their thoughts on above issue and proposed solution @getsaurabh02 , @psychbot , @shwetathareja

Bukhtawar commented 1 year ago

For indexing the payload bytes serve as a good mechanism to estimate the amount of work. What you are referring to is a mechanism to do admission control. Backpressure is a way to tell the higher layers to not accept more work if the last leg is slogging to catch up. To improve on backpressure we need a mechanism to get the CPU cost of the indexing request and apply backpressure. Note backpressure isn't about rejection point in time. Having said that @bharath-techie and @ajaymovva are working on admission control at multiple level(including transport-layer).

CaptainDredge commented 1 year ago

Backpressure is a way to tell the higher layers to not accept more work if the last leg is slogging to catch up

Yes, that's exactly the intent here, if a downstream node ( eg. replica is downstream for primary node in bulk request lifecycle) is lagging due to cpu contention then upstream node can reject additional work till the downstream node is not under duress.

To improve on backpressure we need a mechanism to get the CPU cost of the indexing request and apply backpressure

Does taking backpressure decisions just on the cpu cost of indexing request is justified? Because other tasks such as search or background tasks such as merging may've been taking a lot of cpu time and in those cases as well we should backpressure the upstream node to not take too much indexing load which will allow downstream nodes to recover. Even if downstream node accepts the request then it'll take a long time to complete indexing requests due to cpu contention and the overall bulk requests will lead to 504s from client perspective.

I believe that that cpu cost of indexing request can be helpful as a secondary parameter to do shard level rejections. Once we are sure the downstream node is under duress due to overall cpu which will be a primary parameter, then we may look at aggregated cpu cost per shard to reject indexing requests for those shard.

For calculating overall cpu time we may choose to device a similar approach to resource tracking framework introduced as part of search backpressure although it isn't straightforward to do so since multiple tasks are involved in indexing request lifecycle.

Note backpressure isn't about rejection point in time

Backpressure is still point in time decision by maintaining a view of last few time units. To me distinction between backpressure and admission control is less about time but more about where, how and on which info they act. Backpressure on a particular node happens by keeping aggregated view of all downstream nodes and rejecting request on the basis of downstream node as opposed to admission control which has local knowledge on the node and rejections are based on self metrics i.e. self rejections (similar to thread pool and circuit breaker rejections). In addition to this admission control doesn't act on transport layer, backpressure protects on both http and transport

Having said that @bharath-techie and @ajaymovva are working on admission control at multiple level

Do we have any GH issue/rfc for this?