operator-framework / operator-controller

A new and improved management framework for extending Kubernetes with Operators
https://operator-framework.github.io/operator-controller/
Apache License 2.0
49 stars 52 forks source link

[epic] Ensure no two ClusterExtensions manage the same underlying object when concurrent reconciles > 1 #1101

Open everettraven opened 1 month ago

everettraven commented 1 month ago

As mentioned in #736 , Helm has support for ensuring the same resources are not managed by multiple Helm Releases. This is sufficient when there is no concurrent reconciliation possible, but we will need to come up with an alternative solution that prevents race conditions when concurrent reconciliation is allowed.

bentito commented 1 month ago

Can you give a concrete example of "when concurrent reconciliation is allowed" including why it would be? It seems like we'd always want Helm's built-in support to ensure the same resources are not managed by multiple Helm Releases. If the possible concurrent manager of a resource is some operator then maybe we need to surface Helm's locks as o-c's own and document, as best practice, for operator authors to respect the locks?

joelanford commented 1 month ago

It is simple to implement admission policy that can catch this situation generally during kubernetes admission, rather than relying on a client to do it (which is what happens now).

Helm's built-in support is problematic for three reasons:

  1. It relies on helm to keep doing it and doing it in the same way.
  2. It suffers from race conditions because it is implemented in a client and not during Kubernetes admission
  3. It is not a general solution that we could apply in the potential future where another lifecycling mechanism is supported by OLM.

We may need to increase the concurrency of our reconciler for a variety of reasons. Today reconcile blocks to populate/update the catalog cache and to pull bundle images. In the future, we may need to support helm charts that have hooks that block progression of install/upgrade/uninstall execution, which happens synchronously in the reconciler.

In order to scale to clusters with frequent ClusterExtension interactions, we will very likely need to handle ClusterExtension reconciles concurrently. As soon as we do that, Helm's guarantees disappear because we will be calling it concurrently.

perdasilva commented 1 month ago

I'll take over this and introduce the VAP. I'll see if I can find a way to test the race condition.