RFC: PolicyServer should report the status of a Policy

fabriziosestito commented 6 months ago

Is your feature request related to a problem?

This change is needed to fix an issue with the Policy Server. Right now, if one of the policies can't be loaded (maybe because it can't be downloaded from the registry or the settings are incorrect), the Policy Server crashes. In Kubernetes, the pod running the Policy Server will keep restarting and eventually get stuck in a crash loop. The only way to fix this is for someone to check the error message from the Policy Server pod and correct the problem.

This is risky because when updating or adding a new policy, the new Policy Server pods might crash, while the old ones, running with the previous working setup, will be lost if something happens to the node they are on. As a result, the cluster can break since all incoming admission requests will be rejected if there are no working PolicyServer instances.

Solution you'd like

Hot-reload

A hot-reload mechanism would allow the Policy Server to reload the policies without restarting the process.

Policy CRD status update

The Policy Server should be able to update the status of the Policy CRD, to report if the Policy is in a valid state or not.

Proposal 1

Prerequisites: The PolicyServer should be able to update the status of the Policy CRD. Nice to have: Hot-reloading capabilities.

Instead of crashing when there’s an error, the Policy Server should start normally, but Kubewarden should notify the user about the Policy error. The validate endpoint for any policy with errors should reject all requests.

If there are multiple replicas of the Policy Server and all of them report an error, the Policy status should be set to Error (meaning the policy settings are invalid). If only one PolicyServer has an error (like due to a network issue), the Policy status should be marked as Degraded.

The PolicyServer with the error should keep trying to reload the policy using a backoff mechanism (retrying at increasing intervals). If the PolicyServer pod is removed (for example, when the deployment is scaled down), the controller should reset the Policy status to Active.

Proposal 2

Prerequisites: The PolicyServer should be able to update the status of the Policy CRD. Nice to have: Hot-reloading capabilities.

We can add a pre-validation step to check if a Policy is valid before sending it to the PolicyServers. When a new Policy is created or updated, the controller will start the pre-validation by calling a new endpoint on the PolicyServer.

If there are multiple replicas of the PolicyServer, only one will handle the pre-validation since the controller will use the service to contact the endpoint. After the Policy is validated, the PolicyServer should update the Policy CRD to report the status to the controller.

If there’s an error, the controller won't update the PolicyServer’s ConfigMap, so the Policy won’t be applied. If there’s a network error, the PolicyServer should keep retrying validation using a backoff mechanism (increasing wait time between retries).

The controller should save the last valid Policy in a ControllerRevision resource. This way, if a new Policy is loaded without errors, the valid Policy can be used to build the ConfigMap.

Open questions: how to create a rollback mechanism if a CRD is updated with an invalid Policy? Should we automatically roll back to the last valid Policy if the pre-validation fails?

Alternatives you've considered

No response

Anything else?

No response

ferhatguneri commented 4 months ago

@flavio This is and important Spike I believe needs priority to make kubewarden policy server more resilient and better product. We are wondering when this can be picked up to have some progress on it ?

flavio commented 3 weeks ago

@ferhatguneri we've made some investigation, we've updated the issue with two proposals. My favorite one would be the second.

It would be great if you could provide your feedback

flavio commented 2 weeks ago

Let's write this RFC as part of 1.18, then work on this feature as part of 1.19

viccuad commented 2 weeks ago

I favour Proposal 1. In addition, users could set the policies' spec.failurePolicy so policies may be ignored when they are errored, or we could add a new controller setting that ignores policies in degraded state. While this of course is a security relaxation, it may temporary ease on recovering cluster nodes or performing migrations, without the need to change all deployed policies. This ties in with https://github.com/kubewarden/kubewarden-controller/issues/496.

On Proposal 2, I dislike the misdirection between the instantiated policies and the actual validated policies that would end in the ControllerRevision and the PolicyServers Configmap. The users may need to do extra work to know which policies are actually deployed and which ones got rolled back. In the case of a node crash where we lose the old versions of policies that would be on the ControllerRevision, we may end with the same problem as Proposal 1 and we would only have added misdirection.

kubewarden / rfc