Latest operator container image breaks all instances completely

mhutter commented 1 year ago

Observed behavior

All Keepalived containers created by the operator are in InitCrashLoopBackoff with the following error message in the log of the config-setup init container:

bash: /usr/local/bin/notify.sh: No such file or directory

As a result, all managed IPs (and hence all applications on the cluster) are unreachable.

Cause

The latest version of quay.io/redhat-cop/keepalived-operator:latest is missing the notify scripts used in the init container as well as the config-reloader container of the generated keepalived Daemonsets.

This is a problem because

The operator hardcodes the image of those containers to quay.io/redhat-cop/keepalived-operator:latest
The operator hardcodes the imagePullPolicy to Always
the containers can't come up without said scripts.

Workaround

Scale the keepalived-operator-controller-manager deployment to zero
Manually patch all generated keepalived DaemonSets to use quay.io/redhat-cop/keepalived-operator:v1.4.2 instead

This is very brittle as the deployment may be scaled up again at any time by the OLM or whatever.

mhutter commented 1 year ago

Pointing my finger at #96 and pinging @raffaelespazzoli :)

raffaelespazzoli commented 1 year ago

Thanks for letting me know. It should be fixed now. And yes it's brittle, any suggestions on will to contribute a solution?

ericb-summit commented 1 year ago

Grazie @raffaelespazzoli can, confirm the script is there now.

I don't think any "fix" is needed except maybe automated testing to be sure the script is there before pushing to quay.io. I'm not GH actions expert.

Thanks again

raffaelespazzoli commented 1 year ago

phew all is good what ends good. And yes your point is noted...we should have better automated tests.

mhutter commented 1 year ago

Hmm, my initial idea was to use tagged images in those containers as well, but this would currently mean runnning a very outdated image.

But some automated testing before pushing would be nice!

raffaelespazzoli commented 1 year ago

at the moment we only test the deployment of the operator with helm. It's the only thing we were able to automate across all of the operators we maintain. Contributions in this space, even operator-specific contributions, are welcome

redhat-cop / keepalived-operator