[Feature Request] Finer Cluster Health Status

muralikpbhat commented 7 months ago

Is your feature request related to a problem? Please describe

Currently _cluster/health api has a response field called 'status' and it can be green/yellow/red based on the shard availability.

This can be misleading as one can interpret this cluster/health/status as 'stability' of the cluster, where as it is actually showing the availability of the shards. Also, real stability issues like out of memory or high cpu etc can result in shard getting unassigned, and hence yellow/red status might make sense there. But, there can be many other stability issues where the shards can still be available, resulting in wrong cluster status as 'green'.

The converse is also not true always. For example, all yellow and red status doesn't really mean a 'cluster health' issue. For example, during deployments, a shard can be temporarily unavailable (yellow) or shards are initially unassigned (red) during index creation, which all are benign and shouldn't ring a bell.

Describe the solution you'd like

I suggest granular (finer) status for solving this problem.

Ideally the current 'status' can indicate the 'stability' or 'duress' of the cluster based on some thresholds on some dimensions like resource usages, api errors, throttling etc(detailed proposal TBD). We introduce a new field in the response of cluster/health called 'shard_status' which show green/yellow/red based on shard availability(existing definition). Since changing the current 'status' might be perceived as backward incompatible, there is also an option of keeping 'status' as is, and introducing 'cluster_status' and 'shard_status' as new fields all together. However, I would prefer changing the behavior of 'status' itself to solve the perception issue and deal with backward incompatibility.

Related component

Cluster Manager

Describe alternatives you've considered

Another alternative is to introduce a new api '_shards/health'.

While this gives a new status for the actual shard availability, and it doesnt disturb the current cluster/health, we may not be clearly solving the perception issue with that old api.

Additional context

No response

andrross commented 7 months ago

Thanks @muralikpbhat! I agree that the current cluster status is often interpreted as "is my cluster healthy?" but the actual definition of that field is much more narrow as you described. I'm on board with creating a field that actually tries to represent "is my cluster healthy?".

I've also thought about the durability aspect related to this. In the regular cluster model, a "yellow" shard status can represent a real risk to data durability, whereas with the new remote store-based architectures durability is never at risk due to node failures and "yellow" can only mean a risk to availability. Do you think a durability status field would be useful, or is that maybe too much in the weeds?

I would prefer changing the behavior of 'status' itself to solve the perception issue and deal with backward incompatibility.

The devil is in the details of how exactly we would deal with backward incompatibility. I suspect users have lots of tooling, procedures, and informal processes built around the existing definition of "status" and changing it would be quite impactful. I'd lean towards introducing a new field (probably two new fields, where "shard_status" is an alias for the existing "status" field). The next major version could then change the definition of the top level "status" field as a breaking change.

andrross commented 7 months ago

[Triage - attendees 1 2 3] @muralikpbhat Thanks for filing this issue, looking forward to more discussion on this topic.

muralikpbhat commented 7 months ago

Thanks @andrross for the feedback.

backward incompatibility

Agree with the concern here. While it is hard to automate any action based on the current ambiguous status, we cant rule out some existing automation. Let us change that in major version only and add the new field for now.

durability indicator

Great suggestion, however durability is more nuanced than the dataloss that happens due to loss of last shard copy. Also durability indicator makes sense only for local cluster without remote store. So, i would use shard_status to just indicate availability and even can rename it to shard_availability_status to be explicit. However, it might be good idea if we can have different colour for 'some copies unavailable' vs 'only one copy available' to show the potential risk of full unavailability.

creating a field that actually tries to represent "is my cluster healthy?".

This is the main challenge. Any thoughts on what metrics/stats we should consider? I was thinking of stats like cpu, memory, disk and errors and keep the thresholds for yellow and red as configurable by user. We can also enhance with more metrics as we learn more. In addition to the 'cluster_status', we might also need to show more details to reason out which of the metrics caused the cluster to go red/yellow etc.

andrross commented 7 months ago

Any thoughts on what metrics/stats we should consider?

Just thinking out loud, but should we tie this to the existing back pressure mechanisms, i.e. disk watermark thresholds, circuit breakers, search/indexing back pressure, etc? If any one of these mechanisms is rejecting work, the new status should show "red". If any one is close (for some configurable definition of "close") to rejecting work, then my status should show "yellow".

muralikpbhat commented 7 months ago

IMO, those existing mechanisms are "actions" when some "metrics or stats" have breached thresholds. The 'status change' is also such an 'action'. There will be overlap in which metrics are considered for which actions, but there could be additional metrics or different thresholds for some actions like 'status'. So, I would suggest we model the 'status' on metrics instead of other actions. Now, the other actions might emit metrics/stats and those can feed into the new status action. For example, all of circuit breakers, back pressures results in rejections, and that 'rejection or errors' is an input for 'status' action.

shwetathareja commented 7 months ago

Thanks @muralikpbhat for the proposal.

I like the idea to provide more granular status which can differentiate between availability vs stability status, it would be useful to users and help them identify the issues also faster. Stability could also be from data vs cluster manager node perspective. Similarly, write vs read availability could be different. Shards going red for searchable snapshots impacts only read compared to a hot index which is taking both read/ write along with how many copies are available.

Some users may configure alarms on these statuses, so the metrics feeding into these status calculation has to be crisp otherwise it can confuse the users when they are trying to debug which metrics caused the status to change, so that they can take appropriate action. A summarized view across all the metrics which contribute to availability/ stability status instead of going through list of _cat/ _cluster APIs would be useful.

Thinking more on the backward incompatibility of the status in _cluster/health API, currently the status means availability status of shards directly/ indirectly. A breaking change would be preferred in next major version i.e. 3.0. We can deprecate existing _cluster/health API but it is so widely used API that impact would be huge of deprecation. I am in favor adding more granular status in the existing API itself. In 2.x it can be based on a new query param and 3.x it can become default response.

Any thoughts on what metrics/stats we should consider? I was thinking of stats like cpu, memory, disk and errors

In addition to these, disk throughout, iops, queue build ups, back pressure rejections, admission control rejections, too many shards, pending tasks, skewness ( this may be more tricky) could be useful.

opensearch-project / OpenSearch