Augment the autopilots running on each storage node with a central controller (probably implemented as a Kubernetes operator) to enable automation of additional workflows.
Scenarios
error handling
[x] When a drive error is observed (e.g. in kernel log), unmount that drive.
[ ] When it is a repairable error - do the repair, e.g. xfs_repair
[ ] The user can then cordon the drive, which will remove the swift-id assignment from the drive and (if available) assign a spare disk. Cordons shall be more long-lived than the "broken" flags, and shall particularly survive node reboots.
[ ] A drive may also be cordoned automatically when hardware metrics indicate drive failure.
mount propagation
[ ] When the set of active mounts changes, restart Swift pods consuming these mounts automagically.
[ ] When the autopilot cannot unmount or unmap a drive because of a lingering mount in a different namespace, reboot the node to get rid of the offending mount.
ring propagation
[ ] the operator keep the current rings and is able to propagate them to the swift server processes incl. a coordinated restart if needed
ring building - optional
[ ] The controller tracks all drives, assigns weight to them, and builds the Swift rings.
[ ] The user can check the rebalance output and might issue a swift-ring-buider <ring> dispersion
[ ] The user must approve the ring before rollout
[ ] The user can change weight of the drives and build a new ring, which the controller then distributes to all consuming pods.
Cross concerns
resilience
There should not be too many operations (drive reassignments, rebalancing, rebooting) going on at once to avoid long replication times and/or data loss.
serviceability/discoverability
There should be a user-friendly way to enumerate storage nodes and drives, inspect their status (e.g. list broken drives) and trigger operations (cordon/uncordon, change weight, rebalance).
Replace /run/swift-storage/state/flag-ready and /run/swift-storage/state/unmount-propagation by a new file /run/swift-storage/state/generation.
Presence and absence of that file have the same semantics as current flag-ready.
The file contains an unsigned integer (initially set to 1 by the autopilot).
When the autopilot has changed the set of active mounts, it increments the generation counter.
In consuming pods, a sidecar observes the generation counter. When it increases, it terminates itself such that Kubernetes restarts the entire pod (and thus adopts the new set of mounts).
When the central controller is present, autopilots have to seek approval before rolling out a new generation.
When a new generation is ready, the autopilot sends a request to roll out this generation to the controller.
The controller confirms rollout only when no other nodes are currently rolling out a new generation.
After rollout of a new generation on one node, the controller waits for the pods on that node to come up healthily, e.g. by observing kubectl get pods and swift-recon.
Ring rollout happens via the same mechanism, but in reverse.
The controller requests that the autopilot increase the generation counter to trigger a pod restart that adopts the new rings.
The controller waits for the appropriate healthchecks to come back positive before moving on to the next node.
Augment the autopilots running on each storage node with a central controller (probably implemented as a Kubernetes operator) to enable automation of additional workflows.
Scenarios
xfs_repair
swift-ring-buider <ring> dispersion
Cross concerns