piraeusdatastore / piraeus-ha-controller

High Availability Controller for stateful workloads using storage provisioned by Piraeus
Apache License 2.0
15 stars 8 forks source link

Bug: node is not reconciled if tainting operation failed #76

Open kvaps opened 3 weeks ago

kvaps commented 3 weeks ago
I1101 06:50:28.044916       1 agent.go:440] updating node taints
I1101 06:50:28.103291       1 agent.go:276] managing node taints failed: failed to update node taints: Operation cannot be fulfilled on nodes "srv1": the object has been modified; please apply your changes to the latest version and try again

This error is thrown here:

https://github.com/piraeusdatastore/piraeus-ha-controller/blob/40d3ee8d115dc44f4b326a44e80fa4bd7acdf0ec/pkg/agent/reconcile_failover.go#L152-L154

WanzenBug commented 3 weeks ago

I guess this could be improved somehow. Ideally, we would not need to retry this, as we could use a proper merge patch, but when I last tried it, it did not work specifically for taints.

Even better would be to move away from tainting directly. One idea would be to have a webhook that either labels all workloads or all PVs with some general anti-affinity, and then only label the node, which should work without having to update the node directly.