Simplify the recovery when an existing node crashes during an expansion

k8ssandra / cass-operator

The DataStax Kubernetes Operator for Apache Cassandra

Apache License 2.0

176 stars 64 forks source link

Currently:

1 node at a time will start -> One node overall, with round robin between racks

Conditions to allow fast lane startups:

Pod has the Ready-To-Start label

Conditions to disallow fast lane startups:

node is planned for a replacement
node has never fully bootstrapped

Things to verify:

If a DC is fully stopped, bringing it back up fully concurrently works without a hiccup

Technical aspects How do we ensure a node was part of the ring before?

The pod name appears in the cassdc .status.nodeStatuses struct with a host ID. cass-operator needs to ensure the entry is added only for nodes that have successfully completed bootstrap (their state is UN). nodeStatuses has to be the perfect representation of the DC topology. Any node removal should reflect there after a scale in/down operation. This can be detected with the LEAVING/REMOVED states in the endpoint state.

k8ssandra / cass-operator

Simplify the recovery when an existing node crashes during an expansion #646