Currently when a node is decommed, it can stay running and listening for kafka requests, including responding to metadata requests and including itself in the response. This effectively hangs clients if they trust the metadata response.
Thoughts on fixing it:
The decommed node will keep sending vote requests to the rest of the cluster (which are rejected). We could send "you're not a member" responses that cause the decommed node to shut down.
Similarly, decommed node will keep sending health monitor requests for updated state. We could send "you're not a member" responses there too.
We should refuse health monitor update requests from nodes which are decommed -- that way the decommed node's health state will organically become stale, and the kafka request handler can error out metadata requests if the state is too old (currently decommed nodes keep their state up to date because the controller leader accepts their requests)
Decommed (or isolated) nodes could also have a fallback check for whether they have received a heartbeat on any raft group (including raft0). If a node hasn't seen any heartbeats for too long, it can reasonably assume it is isolated and/or decommed, and should not serve kafka requests.
Currently when a node is decommed, it can stay running and listening for kafka requests, including responding to metadata requests and including itself in the response. This effectively hangs clients if they trust the metadata response.
Thoughts on fixing it:
JIRA Link: CORE-803