Open throwanexception opened 1 year ago
@throwanexception I'd recommend you to use our latest version, we've simplified the deployment to use a single webhook to verify you aren't experiencing the crash.
@throwanexception I'd recommend you to use our latest version, we've simplified the deployment to use a single webhook to verify you aren't experiencing the crash.
Took about a week of constant usage with the 0.8.0 release with our clusters and we're seeing a similar issue as I reported for v0.7.0. The policy-controller begins to timeout readiness / liveness probes and is restarted by the scheduler. We also see the same exceptions around the image digest when this occurs. From what I can observe the memory usage is growing unbounded (possibly a leak?):
cosign-system)]$ k top pod
NAME CPU(cores) MEMORY(bytes)
policy-controller-555465fd55-g67kc 1332m 1730Mi
policy-controller-555465fd55-mmtw4 1166m 1282Mi
policy-controller-555465fd55-qpcrt 948m 1511Mi
'admission webhook 'policy.sigstore.dev' denied the request: validation failed: invalid value: (pods) must be an image digest: spec.template.spec.containers[0].image'.
This error is supposed to happen whenever the image cannot be parsed to a digest.
Regarding the growing memory usage, I'd observe the logs to identify what is going on the controller. We're using the policy-controller in our cluster and we haven't experienced this memory growing behaviour.
The policy-controller begins to timeout readiness / liveness probes and is restarted by the scheduler
This is weird. I'd try changing the values of the liveness / readiness probes to see if it is an issue related to the growing mem or cpu consumption.
Did you see this memory growing behaviour with v0.7.0 too ?
Fwiw, I'm also seeing a memory leak. v0.9.0 and v0.8.4.
Yesterday, I discovered someone had set the --policy-resync-period
flag to 1m
. My working theory is that the in-memory cache can't handle that frequency, as the default is every 10h. 🤔 Testing reverting to the default frequency today to see if that makes a difference.
Description We're testing out the policy-controller and the Readiness and Liveness probes for the
cosign-policy-controller-webhook
begin to fail after an extended amount of time (~18-24 hours). Up until then the deployment appears to work correctly.After this, the Deployment will continually crash every few minutes.
We've also noticed that we'll get errors about the image digest:
'admission webhook 'policy.sigstore.dev' denied the request: validation failed: invalid value: (pods) must be an image digest: spec.template.spec.containers[0].image'.
Upon retry, it will (usually) resolve the image to a digest correctly.
Our setup is using IRSA to attach the WebIdentityToken to the pod - this is natively supported by
go-containerregistry
so it seems to work correctly here, but unsure if it might be related or not. The pods we're pulling are from ECR so the IRSA WebIdentityToken is used to provide the permissions to access images.The image policy we're using is a single ecdsa256 public key to verify our images, so it seems unlikely to be related.
Our clusters are quite active, especially with the constant synthetic health checking we have going, so images are being pulled frequently for end to end testing. I enabled knative debug logging by changing the configmaps for the services, but the debug output has not been helpful so far.
Any guidance or help would be appreciated!
Version v0.7.0 of the policy-controller