redhat-cop / keepalived-operator

An operator to manage VIPs backed by keepalived
Apache License 2.0
118 stars 36 forks source link

Fix reconciler daemonset upsert timing issue #102

Closed cedricmckinnie closed 1 year ago

cedricmckinnie commented 1 year ago

Fixing reconciler error caused by race condition between sequential daemonset updates happening too fast. Added a simple one second delay between daemonset resource upserts.

Error message was:

1.668550155483559e+09   ERROR   Reconciler error        {"controller": "keepalivedgroup", "controllerGroup": "redhatcop.redhat.io", "controllerKind": "KeepalivedGroup", "keepalivedGroup": {"name":"keepalivedgroup-router","namespace":"keepalived"}, "namespace": "keepalived", "name": "keepalivedgroup-router", "reconcileID": "4a801d28-def5-45a5-a5dd-b700e5513e4a", "error": "Operation cannot be fulfilled on keepalivedgroups.redhatcop.redhat.io \"keepalivedgroup-router\": the object has been modified; please apply your changes to the latest version and try again"}
raffaelespazzoli commented 1 year ago

this is not the right way to solve this problem. I have this race condition in all of my operators. if you could find why it really happens, I'd be grateful. It's an innocuous error and it can be ignored otherwise.

cedricmckinnie commented 1 year ago

Hmm I'll take another look. Is it exactly the same issue with your other operators? It seemed to only happen when I added in the keepalived-template as a ConfigMap. My theory is that when the operator starts up, it applies the daemonset etc. from the keepalived-template that's baked into the operator image then once the configmap template is mounted, it tries to apply the configmap template too quickly.

Alternatively, stackoverflow says it happens when extra fields are applied. This could also be the cause due to the two different manifest sources. https://stackoverflow.com/a/51659296

cedricmckinnie commented 1 year ago

Ok now that I've dug deeper into this, I think this is just a conflict between running the operator locally on the same cluster/namespace as an operator that's deployed. When I exclusively run the operator from my local machine or run the operator within the cluster, it doesn't happen. Increasing operator replica count also works fine due to leader election.

Seems like a non-issue. Closing for now. Thanks!