Drift correction for gatekeeper mutations

mmirecki commented 3 years ago

The feature aims to add a drift correction mechanism to the gatekeeper mutating admission logic.

This feature would be a follow up to Gatekeeper mutations design

The feature assumes that a mutating admission logic is added to Gatekeeper. The mutating admission logic ensures that all resources created conform to a given MutationTemplate. In some cases it would however be possible for a resource to drift away from the current MutationTemplates. This would especially true for resources created during the mutating webhook downtime (the default failure policy for the gatekeeper webhook being: "failurePolicy: Ignore", which we assume would also be true for the mutating webhook). Another possibility would be resources created before a MutationTemplate has been applied, or changes in a MutationTemplate during the lifecycle of a resource.

The drift mechanism would be activated in the following cases:

MutationTemplate change - possibly activated by a webhook on MutationTemplate change.
Gatekeeper downtime - a scan would need to be run after any gatekeeper downtime to check if any unmutated resources are not present
manually, similar to the existing gatekeeper audit.

Desig draft: https://docs.google.com/document/d/14lVMIZK1bKjBg4QEnjvAmcfK5a0yOBHk2Y0yxDA19_Y/edit#

maxsmythe commented 3 years ago

Drift detection and mutation is a good idea. This is similar to #388 in that it is actively trying to correct errors in the cluster. It is also similar to #388 in that it has the potential to be dangerous, and so should be opt-in at least at the start.

Some unanswered design questions I can see around this idea:

Assuming we have something like the deletion of violating resources as mentioned in #388, how would these two functions interact?
Currently, K8s mutation webhooks can have a kind of feedback effect that attempts to converge the output value when webhooks have circular dependencies. In order for drift detection/remediation to be meaningful, mutation cycles need to converge on a final state and not mutate endlessly, otherwise an object will be modified in perpetuity. How do we either guarantee this is the case or detect situations where there is no convergence and warn the user?

If we can somehow build convergence into the design of our mutation semantics, this will be simple to implement and look almost exactly like audit, otherwise we may need to figure out how to do something like iteration counting so we can alert on endlessly mutating objects.

mmirecki commented 3 years ago

Could we maybe annotate the mutated resource with the resource version (and uid?) of rules applied on it? This might spare us the iteration counting. If a resource with a current resource version need to be modified, it will indicate an issue with the rule. Also, would we not have just one mutating webhook for all the rules, and leave it up to the rule engine to handle the rules?

maxsmythe commented 3 years ago

The UID-tracking idea seems like a good one. It would definitely detect not-yet-converged resources. Some questions:

Once we do detect non-convergent mutations what do we do? It may be hard to identify the specific set(s) of mutations causing the problem. "Just don't modify the resource" is probably insufficient, as the resource will then be modified every time the resource version is incremented.
How might this interact with something like a GitOps pipeline, which may apply those mutations prior to submission to K8s?

I think the gold standard is to have either a system where divergence is impossible, or one that uses an algorithm that rejects non-convergent rule/resource combinations.

re: one mutating webhook...

I was presuming only having one mutating webhook. The problem is that even a minimal piece of mutation logic is at risk of being non convergent, for example: "Add one entry to the list" will grow a list to infinity after infinite iterations.

K8s sidesteps this by requiring webhooks to be idempotent, we can do the same here (though I don't know if we can enforce idempotency at a code level).

If users have multiple Mutation primitives (the same way we have multiple Constraint primitives now), then each Mutation primitive will essentially be its own mini webhook.

mmirecki commented 3 years ago

On Wed, Sep 9, 2020 at 3:37 AM Max Smythe notifications@github.com wrote:

The UID-tracking idea seems like a good one. It would definitely detect not-yet-converged resources. Some questions:

Once we do detect non-convergent mutations what do we do? It may be hard to identify the specific set(s) of mutations causing the problem. "Just don't modify the resource" is probably insufficient, as the resource will then be modified every time the resource version is incremented.

Setting status to indicate failed validation would be the easy answer. For changing the resource: the mutation webhook would fail if the constraint set is not idempotent. As for changing the constraints: could we verify a constraint (run it on all affected resources) before we allow to add/modify a constaint? This would require another validating webhook, and would require to test all constraints for a give resource for idempotency (after our constraint is applied). After applying the changed constraint would have to be checked to yield no additional patches.

How might this interact with something like a GitOps pipeline, which may apply those mutations prior to submission to K8s?

I think the gold standard is to have either a system where divergence is impossible, or one that uses an algorithm that rejects non-convergent rule/resource combinations.

re: one mutating webhook...

I was presuming only having one mutating webhook. The problem is that even a minimal piece of mutation logic is at risk of being non convergent, for example: "Add one entry to the list" will grow a list to infinity after infinite iterations.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/open-policy-agent/gatekeeper/issues/809#issuecomment-689244238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD3ZIKBBLMXWPYMLOWCJMOLSE3L4VANCNFSM4QS5SSCA .

maxsmythe commented 3 years ago

We talked about this some in the meeting this week.

Rejecting the request because the mutations for it would not be idempotent is plausible. It misdirects the warning a bit, as it is putting pain on a user making the request, not the admin who's maintaining an ill-behaved mutation. There is also some danger in that such rejected mutations could start causing important requests, such as leader election, to be denied.

Running mutations against all extant resources as a validation step for changing mutations is likely not viable. K8s defaults to a 30 second timeout for API requests. In large enough clusters, it would be impossible to finish that validation within 30 seconds. We've seen this in audit, which has similar logic and can take longer than 30 seconds to complete a run.

ritazh commented 2 years ago

@mmirecki are you still looking into this?

mmirecki commented 1 year ago

@ritazh someday yes, but overloaded with other items atm

open-policy-agent / gatekeeper

Drift correction for gatekeeper mutations #809