sse-secure-systems / connaisseur

An admission controller that integrates Container Image Signature Verification into a Kubernetes cluster
https://sse-secure-systems.github.io/connaisseur/
Apache License 2.0
436 stars 61 forks source link

Using cosign validation works for about 6 hours and then we start getting validation errors for Connaisseur application version 3.6.1 and chart version 2.6.1 #1765

Open edison-vflow opened 5 hours ago

edison-vflow commented 5 hours ago

Describe the bug When using Connaisseur application version 3.6.1 and chart version 2.6.1 on EKS v1.30, using cosign validators where auth.secretName is used and ECR is the image registry, Connaisseur can validate images correctly and after about 6 hrs, validation starts failing with

{
  "level": "error",
  "msg": "error validating Deployment ABCD: error during cosign validation of image AWS_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/APPLICATION: error validating image: [GET https://AWS_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/v2/APPLICATION/manifests/*****: DENIED: Your authorization token has expired. Reauthenticate and try again.]",
  "time": "2024-09-20T16:45:00Z"
}

The validator section is defined as

application:
  validators:
  - name: awsvalidator
    type: cosign
    auth:
      secretName: 'ecr-credentials'
    trustRoots:
    - name: ecr-cosign
      key: ${container_verification_kms_arn}
  - name: allow
    type: static
    approve: true
  - name: deny
    type: static
    approve: false

The issue happens for the awsvalidator that needs ECR credentials provided via the secret ecr-credentials On initial run , validation works for about 6 hrs.This time sometimes varies.

After the 6 or so hours, we start getting the error highlighted above.

At the moment that validation starts failing, various operations in the cluster are blocked, like rollout of deployments.

What we notice is that if we restart Connaisseur, then validation starts working again, until the next expiration.

We have a cronjob that runs every 6 hrs, this is to cater for the fact that the ECR token expires after 12 hrs. This refresh ecr cronjob refreshes the ecr-credentials secret that Connaisseur validator is using. For refreshing the token every 6 hrs before expiration, we are using https://github.com/nabsul/k8s-ecr-login-renew The refreshing seems to be working, as the restart of Connaisseur always works and the restart will be using this refreshed token.

Its looking like the Connaisseur validator that uses the auth.secret mechanisms reads the token in once at startup but does not have a way of reading the token when it is refreshed in the ecr-credentials secret, the same secret it is reading from at start-up.

Would this explain why a restart of Connaisseur seems to always fix the issue ?

Another test we did is to explicitly run the token renewal job at the time Connaisseur validation fails to force token refresh. The credentials are renewed but they are not picked up by a running instance of Connaisseur

Could you give guidance on how best to solve this issue or perhaps what other clients that are using the auth.secret for cosign are doing to always have Connaisseur use the latest token

Expected behavior

  • When Connaisseur cosign validation works initially, it should continue working even after token expiration of the initial token it read in, with an ability to refresh the token and not that Connaisseur should be manually restarted for the new token to be picked up

Optional: To reproduce

To reproduce, install Connaisseur application version 3.6.1 and chart version 2.6.1 on AWS EKS v1.30 Configure your validators section as shown above. This is a setup where the trust roots are taken from KMS and the cosign is using auth.secret where the secret is using the [dockerconfigjson](https://sse-secure-systems.github.io/connaisseur/v3.6.1/validators/sigstore_cosign/#dockerconfigjson) mechanism **Optional: Versions (please complete the following information as relevant):** - OS: Amazon Linux - Kubernetes Cluster: EKS 1.30 - Notary Server: - Container registry: containerd - Connaisseur: chart 2.6.1 application 3.6.1 - Other: **Optional: Additional context** * The same deployment also shows redis erros in the redis pod. Not sure if this affects the validation somehow but just worth mentioning The redis issue was raised separately as its own issue https://github.com/sse-secure-systems/connaisseur/issues/1764
edison-vflow commented 5 hours ago

cc @phbelitz @chrysogonus