Gaganjuneja commented 2 years ago

Is your feature request related to a problem? Please describe. Feature Request #4052.

Problem Statement

OpenSearch core framework performs a very basic Ping based health check. Leader and followers ping each other for the health check and rely on the fact that if the ping request is responded within a time that means the node is healthy. Along with ping, it also checks the disk health by writing a temp file locally on the pinged host. Ping based health check works well but it tells the boolean response whether the node is up or not. It doesn’t give any signal about the deteriorating health of the host. E.g if the disk is performing slow for the last X minutes which might be contributing high latency, low throughput, etc and affecting operations like replication, search queries and ingestion. But the health check framework would only get to know about it once the node stops responding to the ping request. This faulty node may stay in the cluster for a very long time if managed to respond to the ping request but overall affecting the system performance and availability. Scenarios -

Slow/Grey Disk - Slow IO/Disk means that latency of read and write operations are increasing. OpenSearch follows a primary backup synchronous replication model and is sensitive to impairments on the write paths. So disk slowness may impact the cluster performance. Slow IO might not be caught by the ping based node health check if node answers the ping call in time. It requires a better proactive disk health check and evict the node at the earliest and avoid any availability drop or degraded performance.
Sluggish network - Similarly slow network or network packet loss also might not be caught by the ping based health check and will lead to the issues. We should be able to identify such nodes and take action in time.

Current State

At present, we are doing follower ping, read-only filesystem and stuckIO checks.

Leader/Follower checks – Leader pings all the followers on a periodic basis and validates the availability and checks their health. Followers also check their leader's health by ping requests.
Read only filesystem detection – System checks while bootstrapping a new node, if the filesystem is healthy.
Stuck I/O detection – It tries to write a temp file to see if the filesystem is writable. This is being done at multiple places/times.
1. Follower Checker Ping – While doing the follower checker ping it also provides the request handler to detect the Stuck I/O on the follower node. It creates a faulty node list at Leader level with all the failing nodes and the leader removes these unhealthy nodes from the cluster.
2. Leader Check Ping - All follower nodes check the Leader status by making a ping request and similarly they also provide the request handler to detect the Stuck I/O on the leader node and request for leader failure if reported unhealthy.

Once removed the node is prevented from participating in any usual cluster operations by additionally handling below scenarios

Joining the new/earlier removed node - While adding the new node it checks for StuckIO to avoid unhealthy node from joining back the cluster.
Pre-vote request - While handling pre-vote requests this is being checked to remove unhealthy nodes from participating in the voting process.

Describe the solution you'd like

Proposal

The idea is to introduce an extensible health check framework and instead of just being reactive on node failure, it should be able to proactively identify the bad/sluggish nodes well in advance and take necessary action. Health check framework should be generic enough and open doors for users to implement their specific health checks as per their needs. There would be different checks for different cloud providers and underlined infrastructures. OpenSearch core should be resilient to all these health check implementations so that any wrong implementation shouldn’t bring the entire cluster down.

Approach

The approach is to come up with an extension point for Plugins to implement health checks. These health check plugins will be called by a new HealthCheckService from the above-mentioned places/times as applicable. I can think of the following primary features of this framework.

Features

Asynchronous – HealthCheckService should be able to call these plugins asynchronously on a periodic basis and keep the node health ready so that any ping call or other health check calls should be low latent and shouldn’t wait for these plugins to run.
Action – Each health check failure should be associated with some standard action. It could be node eviction or some other remedial action. For now, we are just focussing on node eviction and other remedial actions anyways should be local to host and handled separately. But the system should be able to provide a generic interface and for now just filters the node eviction action.
Low blast radius – Primarily in the case of follower health check if there is some issue with the health check plugin code and it might start sending false UNHEALTHY status which eventually may bring down all the nodes. Similarly for Cluster manager, if LeaderChecker (performed by followers) starts giving false UNHEALTHY then it might stop the quorum formation and eventually the leader election won’t happen and the cluster will be OOS. To avoid this, we need to have a circuit breaker which will be open in case within X minutes if more than Y% or Z (absolute count) nodes reports unhealthy status and further we will stop trusting this health check. This could either be a plugin issue and large-scale activity, in both cases I believe the decision to evict the nodes shouldn’t be taken through a health check framework. Circuit breakers should be able to self-reset automatically in some time. It will be difficult to introduce circuit breakers at Leader level since it requires centralised coordination.
Configurability –
1. Enable/Disable HealthCheckPlugin - There should be a mechanism to disable/exclude the faulty plugin in without a cluster restart.
2. Circuit breaker – Circuit breaker related threshold settings such as X minutes, num/percentage of nodes.
3. Timeouts – Plugin call timeouts, etc.

Plugin Responsibilities

Reliable – Plugin implementers will have to make sure that it is reliable and make the best effort to avoid returning false positive and false negative health updates. They should be able to make decisions quickly but at the same time it shouldn’t be just based on a single metric data point.
Classification - Plugin implementers should be very careful in deciding which type of nodes should call this for health check.
Light weight – Plugin should be light-weight and shouldn’t use a lot of node resources to avoid unnecessary pressure on nodes.

Proposed Solution

Solution – 1 (HealthCheck status with PING call)

We should leverage the existing way of health check and not make a new health check call. Now the request handler should be able to call a NodeHealthService implementation to get the node health. NodeHealthServiceIml should be able to asynchronously call all the health check plugins and keep the node health cached. FollowersChecker would cache the state of all the nodes and use circuit breaker to decide node action. It’s hard to implement a leader checker circuit breaker as it runs on all the follower nodes and it will be difficult to fetch the cluster level state.

Screenshot 2022-09-12 at 5 03 35 PM

Pros

Health status check will be done in a single Ping call and need not to open a new connection.
Limited blast radius with circuit breaker.
Extensible, it can support N number of different plugins.

Cons

Tightly coupled with FollowersChecker and LeaderChecker.
It’s hard to implement a cluster level health view which can be used to implement circuit breaker and stop trusting false positives if there is any bug in the health check plugin or some AZ issues are going on.

PS - We can also think of using this extensible health check only for data/follower nodes and skip the leader nodes. We can discuss the potential health check scenarios for Leaders and decide.

Solution – 2 (Separate service which does health check and maintain the node health status at cluster level)

We can define a new service ClusterHealthCheckService, which will call all the health check plugins for all the nodes and cache their health status. This service will have a cluster level view of the health of all nodes and help in deciding the further action. We can easily detect if 1) faulty nodes belong to the same AZ, 2) Cluster state evaluation if we evict the node, 3) Health check failures are due to some BUG, 4) entire AZ is down, 5) node eviction lead to shard imbalance, etc. This service will give signals to Cluster manager for further node actions.

Screenshot 2022-09-12 at 5 01 51 PM

Pros

Circuit breaker can be implemented at cluster level.
Extensible, it can support N number of different plugins.
Extensible health check need not to be that aggressive like ping check, so we can have separate call frequency like 10/30 seconds.
We can easily isolate the logic to identify false negatives/positives in ClusterHealthService.
Easy to enable/disable the health check plugins, framework, etc.

Cons

Separate call would be needed to fetch the health check so it would be a slight overhead on the system. Though we can reduce the number of calls as it need not be as aggressive as ping check, we can have call frequency like 10/30 seconds.

Use case

Grey Disk detection - Initially, we can start with the grey disk detection as a first use case. OpenSearch has few metrics already baked as a part of Performance Analyser plugin (https://github.com/opensearch-project/performance-analyzer) and these metrics can be further leveraged by Root cause analyser (https://github.com/opensearch-project/performance-analyzer-rca)to detect a faulty node having grey I/O failures. We can write a plugin which calls the RCA framework to get the disk related metrics for a sufficient period of time and decide on the node disk health.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

Gaganjuneja commented 2 years ago

@elfisher @muralikpbhat @reta + others, please provide your inputs. @Bukhtawar Please help me tagging the relevant folks.

reta commented 2 years ago

@Gaganjuneja thanks for the summary, I think it would greatly improve the visibility of the cluster as a whole.

Using cicrcuit breakers as a decision maker for fault detection / node eviction (FollowerChecker / LeaderChecker) seems to be doubtful (not because they are useless but the way circuit breakers are designed). We may look at The Phi Accrual Failure Detector algroithm [1], as far as I know it is used by Akka Cluster and Cassandra for failure detection and is proven to work well.

I personally lean towards 1st approach (afaik this is more or less how it works now but with predefined checks), but I have difficulties to understand what you mean by:

Leader level circuit breaker is hard to implement.

Are you reffering to the case when follower cannot ping leader (but leader can ping follower)? Wouldn't such nodes fail the followers check since they won't be able to send the response to leader?

[1] https://pdfs.semanticscholar.org/11ae/4c0c0d0c36dc177c1fff5eb84fa49aa3e1a8.pdf

Gaganjuneja commented 2 years ago

Thanks @reta for your review.

Let me clarify, Circuit breaker is not being used here for the failure detection but it is just used to reduce the overall blast radius and protect the overall cluster from going down because of the faulty health check plugin. Let's say if there is any bug in the health check plugin and its starts reporting falsely UNHEALTHY and may happen that each node starts sending the false negative health check then circuit breaker will come in place and disable this external health check framework for some time and alert the user to take an action. We can remove the circuit breaker as well and assume that these health checks plugins are tested thoroughly and guarantee the correctness. But yes this is just an additional check to reduce the blast radius.
I am also thinking about something around the Phi Accrual Failure detection technique while implementing the plugin. Here the plugin should decide the action based on data analysis and OS core will execute the action. As of now the supported action would be node eviction. I was thinking of using the RCA framework for time series metric analysis and taking actions based on that. Following is the sample design of a Grey disk detection plugin.

Leader level circuit breaker is hard to implement.

Are you reffering to the case when follower cannot ping leader (but leader can ping follower)? Wouldn't such nodes fail the followers check since they won't be able to send the response to leader?

Yes true, But the idea here is to identify the leader's failure so that followers can trigger the voting if the leader is unreachable.

shwetathareja commented 1 year ago

@Gaganjuneja Thanks for the proposal. This would help in plugging in more health checks easily and surface any system issues sooner before things start breaking in the cluster.

In both the proposals, cluster manager (leader) is maintaining / caching the health status of each node and then making a decision. The main difference that I see is that cluster manager is piggy backing this health status collection over FollowerChecker or make another transport call. As long the health status is cached and payload is not too big, then FollowerChecker can simply return it.

The concern that circuit breaker can't be executed for LeaderChecker, why is that needed? While generating the overall view Leader would have health status of local node as well. Also, i don't see we are doing anything special for Leader node in Proposal 2 anyway.

Another ask is - There should be a health check API exposed for the user as well so that the user can run the health status on demand to check the health of all the nodes or run it on specific node.

Gaganjuneja commented 1 year ago

@shwetathareja thanks for reviewing it. Please find below the response to your queries.

In both the proposals, cluster manager (leader) is maintaining / caching the health status of each node and then making a decision. The main difference that I see is that cluster manager is piggy backing this health status collection over FollowerChecker or make another transport call. As long the health status is cached and payload is not too big, then FollowerChecker can simply return it.

There are a couple of more points - 1) FollowerCheck is very aggressive and it happens every second but we don't want pluggable health checks to be that aggressive. 2) In case full cluster level view is needed then we want to add this to a plugin running on the cluster manager node to maintain the cluster level state and just return the list of unhealthy nodes to the ClusterHealthCheckService. so that we don't need to maintain state at ClusterHealthCheckService level for each node and plugin combination. 3) CircuitBreakers - CircuitBreakers are needed to reduce the blast radius in case of false negatives. So, if we add all these functionalities in the FollowersChecker it will unnecessarily be overloaded and will be doing a lot of stuff.

The concern that circuit breaker can't be executed for LeaderChecker, why is that needed? While generating the overall view Leader would have health status of local node as well. Also, i don't see we are doing anything special for Leader node in Proposal 2 anyway.

You are right, let's take a scenario - every node runs the leader checker to know if the leader is active and healthy, if not they initiate the leader replacement and voting activities. If there is some bug in the plugin and the leader starts responding unhealthy all the time then quorum will be lost and the cluster goes to the red state. Yes, It's not being resolved in proposal-2 and actually we have skipped it for now and focussing on data nodes only.

Another ask is - There should be a health check API exposed for the user as well so that the user can run the health status on demand to check the health of all the nodes or run it on specific node.

Definitely a good call out, I need to check how it's being done as of now so that we can easily plug in there as well. But today only single action is only supported and that is node - eviction. So if unhealthy nodes are going to be evicted then health status would be the same as node available/unavailable status.

BlackMetalz commented 1 year ago

Waiting for this since whole clusters of mine died related to Stuck I/O detection xD

kartg commented 1 year ago

@Gaganjuneja @Bukhtawar I've changed the release version on this to v2.7.0 since the code freeze date for v2.6.0 is today (Feb 21). Please let me know if that's not accurate.

kotwanikunal commented 1 year ago

@Gaganjuneja @Bukhtawar Reaching out since this is marked as a part of v2.7.0 roadmap. Please let me know if this isn't going to be a part of the release.

DarshitChanpura commented 1 year ago

@Gaganjuneja This issue will be marked for next-release v2.8.0 on (Apr 17) as that is the code-freeze date for v2.7.0. Please let me know if otherwise.

DarshitChanpura commented 1 year ago

Tagging it for next release: v2.8.0

hdhalter commented 1 year ago

@DarshitChanpura - I assume this is not going out in 2.8? Do we have a target release?

DarshitChanpura commented 1 year ago

@DarshitChanpura - I assume this is not going out in 2.8? Do we have a target release?

I'm not sure. @Gaganjuneja Do you have an expected ETA for this?

cwperks commented 1 year ago

Is there any target release for this issue? I see that its tagged as v2.8.0, but don't see any associated PRs.

Gaganjuneja commented 1 year ago

@DarshitChanpura @cwperks I am not able to prioritise this for 2.8 release.

cwperks commented 1 year ago

Thank you @Gaganjuneja. I am removing the version label on this issue. Please add back a version label when this is scheduled.

opensearch-project / OpenSearch

[RFC] Extensible Health check framework #4244

Problem Statement

Current State

Proposal

Approach

Features

Plugin Responsibilities

Proposed Solution

Solution – 1 (HealthCheck status with PING call)

Pros

Cons

PS - We can also think of using this extensible health check only for data/follower nodes and skip the leader nodes. We can discuss the potential health check scenarios for Leaders and decide.

Solution – 2 (Separate service which does health check and maintain the node health status at cluster level)

Pros

Cons

Use case