Audit Scanner fails completely with typo in policy

Martin-Weiss commented 5 days ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

Using this policy (example from https://artifacthub.io/packages/kubewarden/verify-image-signatures/verify-image-signatures)

apiVersion: policies.kubewarden.io/v1
kind: ClusterAdmissionPolicy
metadata:
  name: verify-image-signatures
spec:
  module: ghcr.io/kubewarden/policies/verify-image-signatures:v0.1.7
  rules:
  - apiGroups: ["", "apps", "batch"]
    apiVersions: ["v1"]
    resources: ["pods", "deployments", "statefulsets", "replicationcontrollers", "jobs", "cronjobs"]
    operations:
    - CREATE
    - UPDATE
  mutating: true
  settings:
    signatures:
    - image: "*"
      githubActions:
        owner: "kubewarden"
        repo: "app-example"

with kubewarden-controller-2.1.0 kubewarden-crds-1.5.1 kubewarden-defaults-2.0.3 ends up in audit-scanner failing fatal with the following

kubectl -n kubewarden logs audit-scanner-28655400-49zwv 
{"level":"info","RunUID":"b87b30fc-5633-42c8-ab21-bf585d6826a6","time":"2024-06-25T14:00:01Z","message":"clusterwide resources scan started"}
{"level":"fatal","error":"no matches for apps/v1, Resource=pods","time":"2024-06-25T14:00:02Z","message":"Error on cmd.Execute()"}

Expected Behavior

Audit scanner must not fail completely if a policy is not perfect ok.

Steps To Reproduce

See above - more details here:https://github.com/Martin-Weiss/rancher-fleet/tree/bb68ca7288b42dcc7ddea27d5ffabde7fe241d22/kubewarden

Environment

rke2 1.28.10 x86_64 on linux

Anything else?

No response

viccuad commented 3 days ago

Hi, there's several problems here.

The reproduced is misconfigured:

The policy module is an old tag v0.1.7, from 2022-08, and this policy version is incompatible with the provided spec.settings.
The spec.rules is incorrect and ambiguous (how does k8s know that apiGroups batch applies only to cronjobs and not pods for example?).

At first view, one could think that the reproducer was crafted from the README.md example, but the rules don't match it.

The easiest solution is to use kwctl, which takes into account the metadata of the policy to provide with a correct ClusterAdmissionPolicy, and craft from there. This is normally shown on artifacthub "install" button (but not for this policy as its latest release happened a while ago).

kwctl pull ghcr.io/kubewarden/policies/verify-image-signatures:v0.2.8
kwctl scaffold manifest -t ClusterAdmissionPolicy registry://ghcr.io/kubewarden/policies/cel-policy:v0.2.8

Solutions

Scaffold policy correctly.
[x] Improve this policy's UX: Update its metadata.yml with more spec.rules, make new release. https://github.com/kubewarden/verify-image-signatures/pull/113

Policy targeting nonexistent resources

The reproducer policy has wrong and ambiguous spec.rules. The policy would apply for example to apiGroups apps, apiVersion v1, resource pods, which doesn't exist (apiGroups for Pods is the empty string).

A request for this specific GroupVersionKind would never come from the K8s API Server, but can come from the audit-scanner, as we don't use the spec.rules until policy execution.

It would be good to validate the spec.rules ahead of time. Yet the policy spec.rules can't be validated on policy instantiation, as the GroupVersionKind resources available in the cluster can vary over time (with K8s upgrades, or CRDs for example). I will see if anything can be done in the audit-scanner.

Audit-scanner erroring and stopping when a specific policy has an error

If the audit-scanner sends a request to a policy that errors on execution (e.g: reproducer but with tag v0.2.8), we currently abort the audit-scan run. We do this for security reasons; this way misconfigured policies will not silently be ignored. We could improve this UX somehow.

Martin-Weiss commented 3 days ago

Thanks a lot for the analysis and the feedback. Basically I believe that the error in the audit scanner leading to a full stop of the complete cluster scan is not a good practice. Admins might not even realize that the audit-scanner pod fails to run on the scheduled plan when they adjust any of the policies.

To address this I believe we need some sort of policy validation already in the policy server and we should see the policy server to fail starting. Also keep in mind that the audit scanner is an optional functionality that even might not be enabled.

In addition to that the logs in the audit scanner do not give any good indication on which policy it is failing and why.. so from a usability point of view I guess this should be changed - the audit scan should continue to scan all other stuff as well but the reports it creates should show where something is not ok..

viccuad commented 3 days ago

Indeed.

some sort of policy validation already in the policy server and we should see the policy server to fail starting

This is already the case. It happens for the reproducer provided.

On this issue, there's 2 reasons for errors on the audit scan run. Nevertheless, the audit scan run should continue gracefully. I will work on both:

An error on the audit-scanner code. This case here, a policy with incorrect spec.rules. Here the k8s client used in audit-scanner fails and we bubble up the error to a fatal error. We should instead print the error and continue the audit scan.
An error on the policy evaluation. We should print the error and continue the audit scan (it seems we do that already).

kubewarden / audit-scanner