vmware-labs / reconciler-runtime

⚠️ Maintenance suspended. Please, migrate to the active fork reconciler.io/runtime. See https://github.com/reconcilerio/runtime/releases/tag/v0.20.0 for instructions. This repository will be archived eventually.
Other
81 stars 18 forks source link

Determinism of IdentifyChild is confusing #481

Closed squeedee closed 6 months ago

squeedee commented 7 months ago

Describe the bug

Because the ResourceManager lists and filters the resources via the client again, at the point of .Manage - it loads a possibly mutated resource and uses that for that last IdentifyChild

This leads to surprising results, because I take the guidance in the docs:

// Non-deterministic IDs will result in the rapid deletion and creation of child resources.

to mean deterministic for the data passed to this reconcile

We were able to find a more "stable" form of identity, however it took a long time to discover the issue because of our assumptions about the data passed into the reconcile.

mamachanko commented 7 months ago

We were able to find a more "stable" form of identity

Can you describe what identity you started out with and what you settled on eventually? I reckon it would help to exemplify the issue.

scothis commented 7 months ago

// Non-deterministic IDs will result in the rapid deletion and creation of child resources.

to mean deterministic for the data passed to this reconcile

Deterministic within the context of a single reconcile request, but it would be odd for the value to change between different reconcile requests.

IdentifyChild is called for every known and desired child during the reconcile. The result is used to determine if a desired child needs to be created because it doesn't exist, update an existing resource, or delete a no longer desired resource.

squeedee commented 7 months ago

Deterministic within the context of a single reconcile request, but it would be odd for the value to change between different reconcile requests.

During such time the "knownChild" and it's counterpart on cluster can diverge.

We see the known child, we replace it with an update, but the key for the "update" does not match because some status on cluster changed. I know its a pretty wild edge case, but we were keying on what we knew about the resource in KnownChild and kept it consistant for the duration of the reconcile, and yet, it changed. It required us to have an intimate awareness of the entire machinery for ChildSet.

squeedee commented 7 months ago

To try and add clarity:

scothis commented 7 months ago

During such time the "knownChild" and it's counterpart on cluster can diverge.

This is inherently true for all reconcilers. Either the change is detected and the reconciled resource is reenqueued normally, or the update fails and the request is retried with the updated child.

we were keying on what we knew about the resource in KnownChild and kept it consistant for the duration of the reconcile, and yet, it changed

Can you explain more about the scenario?

If another process is modifying the child resource, you need to be robust and handle it, or fail gracefully. If that process is modifying the identity that's harder to handle. Depending on the context it may result in that resource being orphaned or deleted as no longer desired.

If you need to defend against a hostile process an admission webhook for the child resource can prevent the identity from being removed.

scothis commented 7 months ago

Child.rm.Manage - lists children, filters, and it's a newer version than seen in "knownChildren" above.

ahh. Ok, this issue something that bugged me about the initial implementation as it's a bit inefficient, but didn't think it would cause issues. We can update the Manager to accept a list of known children so it doesn't re-list children. The update will fail if the underlying resource was mutated, as usual.

squeedee commented 7 months ago

Yah we're already conditioned to look for RV changes, but this was a "trying to create but you gave it a name and generate-name" because it was now being treated as a create. It took a lot of tracer logs for us to work this out, because we were working on the false assumption that the "data in" didn't change for the duration of the core reconciler methods, namely "DesiredChildren" and "IdentifyChild"