Liveness and Readiness Probes Consistently Failing

throwanexception commented 1 year ago

Description We're testing out the policy-controller and the Readiness and Liveness probes for the cosign-policy-controller-webhook begin to fail after an extended amount of time (~18-24 hours). Up until then the deployment appears to work correctly.

44m         Warning   Unhealthy          pod/cosign-policy-controller-webhook-bc7d858f6-49mz7   Readiness probe failed: Get "https://x:8443/readyz": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
7s          Warning   BackOff            pod/cosign-policy-controller-webhook-bc7d858f6-49mz7   Back-off restarting failed container
14m         Warning   Unhealthy          pod/cosign-policy-controller-webhook-bc7d858f6-49mz7   Liveness probe failed: Get "https://x:8443/healthz": read tcp x:56308->x:8443: read: connection reset by peer
40m         Warning   Unhealthy          pod/cosign-policy-controller-webhook-bc7d858f6-m8phs   Liveness probe failed: Get "https://x:8443/healthz": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

After this, the Deployment will continually crash every few minutes.

We've also noticed that we'll get errors about the image digest: 'admission webhook 'policy.sigstore.dev' denied the request: validation failed: invalid value: (pods) must be an image digest: spec.template.spec.containers[0].image'.

Upon retry, it will (usually) resolve the image to a digest correctly.

Our setup is using IRSA to attach the WebIdentityToken to the pod - this is natively supported by go-containerregistry so it seems to work correctly here, but unsure if it might be related or not. The pods we're pulling are from ECR so the IRSA WebIdentityToken is used to provide the permissions to access images.

The image policy we're using is a single ecdsa256 public key to verify our images, so it seems unlikely to be related.

Our clusters are quite active, especially with the constant synthetic health checking we have going, so images are being pulled frequently for end to end testing. I enabled knative debug logging by changing the configmaps for the services, but the debug output has not been helpful so far.

Any guidance or help would be appreciated!

Version v0.7.0 of the policy-controller

hectorj2f commented 1 year ago

@throwanexception I'd recommend you to use our latest version, we've simplified the deployment to use a single webhook to verify you aren't experiencing the crash.

throwanexception commented 1 year ago

@throwanexception I'd recommend you to use our latest version, we've simplified the deployment to use a single webhook to verify you aren't experiencing the crash.

Took about a week of constant usage with the 0.8.0 release with our clusters and we're seeing a similar issue as I reported for v0.7.0. The policy-controller begins to timeout readiness / liveness probes and is restarted by the scheduler. We also see the same exceptions around the image digest when this occurs. From what I can observe the memory usage is growing unbounded (possibly a leak?):

cosign-system)]$ k top pod
NAME                                 CPU(cores)   MEMORY(bytes)
policy-controller-555465fd55-g67kc   1332m        1730Mi
policy-controller-555465fd55-mmtw4   1166m        1282Mi
policy-controller-555465fd55-qpcrt   948m         1511Mi

hectorj2f commented 1 year ago

'admission webhook 'policy.sigstore.dev' denied the request: validation failed: invalid value: (pods) must be an image digest: spec.template.spec.containers[0].image'.

This error is supposed to happen whenever the image cannot be parsed to a digest.

Regarding the growing memory usage, I'd observe the logs to identify what is going on the controller. We're using the policy-controller in our cluster and we haven't experienced this memory growing behaviour.

The policy-controller begins to timeout readiness / liveness probes and is restarted by the scheduler

This is weird. I'd try changing the values of the liveness / readiness probes to see if it is an issue related to the growing mem or cpu consumption.

Did you see this memory growing behaviour with v0.7.0 too ?

austinorth commented 4 months ago

Fwiw, I'm also seeing a memory leak. v0.9.0 and v0.8.4.

austinorth commented 4 months ago

Yesterday, I discovered someone had set the --policy-resync-period flag to 1m. My working theory is that the in-memory cache can't handle that frequency, as the default is every 10h. 🤔 Testing reverting to the default frequency today to see if that makes a difference.

sigstore / policy-controller

Liveness and Readiness Probes Consistently Failing #824