linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.73k stars 583 forks source link

Self-healing can forever "miss" broker failures that occur while it is disabled #1367

Open mgrubent opened 3 years ago

mgrubent commented 3 years ago

Background

cruise-control provides self-healing for a variety of anomalies.

When an anomaly is detected, and the corresponding self-healing boolean is enabled, cruise-control can generate a proposal on its own to correct the anomaly, and then execute that proposal.

When these anomalies happen while the corresponding self-healing boolean is disabled, cruise-control (as-expected) does not generate proposals to correct the anomaly.

However, when an anomaly (e.g. a broker failure) occurs while self-healing is disabled, that same anomaly event will not cause cruise-control to generate a proposal when self-healing is later enabled.

Problem

Anomalies that occur while self-healing are disabled can essentially be "lost" to cruise-control, and some other actor has to tell cruise-control affirmatively to act. Re-enabling self-healing is insufficient to cause cruise-control to heal the anomaly.

Proposal

When cruise-control's anomaly-detection is re-enabled, any outstanding (i.e. not stale) anomalies should be acted on.

efeg commented 3 years ago

The issue description is a little off. This is only relevant for a broker_failure anomaly, under the following conditions:

Ideally, CC broker failure self-healing should act on a broker failure that happened before self-healing was enabled.

mgrubent commented 3 years ago

Ah; will update the title