operator-framework / operator-lifecycle-manager

A management framework for extending Kubernetes with Operators
https://olm.operatorframework.io
Apache License 2.0
1.72k stars 545 forks source link

OLM crashes etcd when update fails #2748

Open bo0ts opened 2 years ago

bo0ts commented 2 years ago

Bug Report

What did you do?

What did you expect to see?

I did expect the installation to back-off from attempts exponentially and the cluster to remain stable.

What did you see instead? Under which circumstances?

The flood of installation attempts led to etcd timeouts and failures during leader election leading to multiple restarts of other operators and further failures. The default OpenShift API Fairness and Priority rules did not prevent this from happening.

Environment

exdx commented 2 years ago

When a CSV fails, there is a way to mark errors as unrecoverable versus a recoverable failure. There is a small list of unrecoverable failures but most are recoverable. To solve this, the unrecoverable list should be updated to included cases where an immutable field is attempted to be updated during the course of an upgrade. If OLM doesn't encounter an unrecoverable error when installing the CSV it will always continue to try to install it.

Updating an operator that includes a change to an immutable field would require one to remove the existing version of the operator before attempting to install the newer version. Since OLM does patch updates, it cannot successfully install the newer version.

bo0ts commented 2 years ago

@exdx I'm not sure I agree. The immutable field error is a classic and even part of the troubleshooting documentation . Retrying here is perfectly fine for me, because it is an issue that has to be resolved manually during installation and can be done easily in most cases (just remove the offending object and let it be recreated by the operator installation - instead of removing the entire operator).

My problem is the way OLM actually retries and that is does not back-off after multiple failures.