sse-secure-systems / connaisseur

An admission controller that integrates Container Image Signature Verification into a Kubernetes cluster
https://sse-secure-systems.github.io/connaisseur/
Apache License 2.0
442 stars 62 forks source link

Using cosign validation works for about 6 hours and then we start getting validation errors for Connaisseur application version 3.6.1 and chart version 2.6.1 #1765

Open edison-vflow opened 2 months ago

edison-vflow commented 2 months ago

Describe the bug When using Connaisseur application version 3.6.1 and chart version 2.6.1 on EKS v1.30, using cosign validators where auth.secretName is used and ECR is the image registry, Connaisseur can validate images correctly and after about 6 hrs, validation starts failing with

{
  "level": "error",
  "msg": "error validating Deployment ABCD: error during cosign validation of image AWS_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/APPLICATION: error validating image: [GET https://AWS_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/v2/APPLICATION/manifests/*****: DENIED: Your authorization token has expired. Reauthenticate and try again.]",
  "time": "2024-09-20T16:45:00Z"
}

The validator section is defined as

application:
  validators:
  - name: awsvalidator
    type: cosign
    auth:
      secretName: 'ecr-credentials'
    trustRoots:
    - name: ecr-cosign
      key: ${container_verification_kms_arn}
  - name: allow
    type: static
    approve: true
  - name: deny
    type: static
    approve: false

The issue happens for the awsvalidator that needs ECR credentials provided via the secret ecr-credentials On initial run , validation works for about 6 hrs.This time sometimes varies.

After the 6 or so hours, we start getting the error highlighted above.

At the moment that validation starts failing, various operations in the cluster are blocked, like rollout of deployments.

What we notice is that if we restart Connaisseur, then validation starts working again, until the next expiration.

We have a cronjob that runs every 6 hrs, this is to cater for the fact that the ECR token expires after 12 hrs. This refresh ecr cronjob refreshes the ecr-credentials secret that Connaisseur validator is using. For refreshing the token every 6 hrs before expiration, we are using https://github.com/nabsul/k8s-ecr-login-renew The refreshing seems to be working, as the restart of Connaisseur always works and the restart will be using this refreshed token.

Its looking like the Connaisseur validator that uses the auth.secret mechanisms reads the token in once at startup but does not have a way of reading the token when it is refreshed in the ecr-credentials secret, the same secret it is reading from at start-up.

Would this explain why a restart of Connaisseur seems to always fix the issue ?

Another test we did is to explicitly run the token renewal job at the time Connaisseur validation fails to force token refresh. The credentials are renewed but they are not picked up by a running instance of Connaisseur

Could you give guidance on how best to solve this issue or perhaps what other clients that are using the auth.secret for cosign are doing to always have Connaisseur use the latest token

Expected behavior

  • When Connaisseur cosign validation works initially, it should continue working even after token expiration of the initial token it read in, with an ability to refresh the token and not that Connaisseur should be manually restarted for the new token to be picked up

Optional: To reproduce

To reproduce, install Connaisseur application version 3.6.1 and chart version 2.6.1 on AWS EKS v1.30 Configure your validators section as shown above. This is a setup where the trust roots are taken from KMS and the cosign is using auth.secret where the secret is using the [dockerconfigjson](https://sse-secure-systems.github.io/connaisseur/v3.6.1/validators/sigstore_cosign/#dockerconfigjson) mechanism **Optional: Versions (please complete the following information as relevant):** - OS: Amazon Linux - Kubernetes Cluster: EKS 1.30 - Notary Server: - Container registry: containerd - Connaisseur: chart 2.6.1 application 3.6.1 - Other: **Optional: Additional context** * The same deployment also shows redis erros in the redis pod. Not sure if this affects the validation somehow but just worth mentioning The redis issue was raised separately as its own issue https://github.com/sse-secure-systems/connaisseur/issues/1764
edison-vflow commented 2 months ago

cc @phbelitz @chrysogonus

edison-vflow commented 2 months ago

@phbelitz @chrysogonus , another way this could be fixed with less effort is to allow the Connaisseur deployment to add custom annotations from the values.yaml file.

Currently the annotations that can be added on a deployment are restricted

If we can allow custom annotations here

https://github.com/sse-secure-systems/connaisseur/blob/c040f4d4a31394de5c53408fc770978c080b5876/charts/connaisseur/templates/deployment.yaml#L8

Then we can use something like https://github.com/stakater/Reloader where we add the annotation

    reloader.stakater.com/auto: "true"

I see that the ecr-credentials secret ends up being included in the deployment as a volume mount, which makes it a perfect candidate for Reloader to restart the deployment

volumes:
        - name: certs
          secret:
            secretName: connaisseur-tls
        - name: redis-certs
          secret:
            secretName: connaisseur-redis-tls
        - name: app-config
          configMap:
            name: connaisseur-app-config
        - name: connaisseur-alert-config
          configMap:
            name: connaisseur-alert-config

        - name: ecr-credentials-volume
          secret:
            secretName: ecr-credentials

I have played around with this idea further by manually editing the Connaisseur deployment to add Reloader annotations

There were 2 outcomes.

  1. Using https://github.com/nabsul/k8s-ecr-login-renew will not work in its current state.The utility creates a brand new secret called ecr-credentials.Reloader on the other hand can actually do a rollout to a deployment when either a configmap or a secret its using are changed.This does not include when the secret is destroyed and recreated
  2. I did a second test whereby instead of refreshing the ecr-credentials secret the way https://github.com/nabsul/k8s-ecr-login-renew does it, we actually modify the value of the existing secret.This worked perfectly and the Connaisseur deployment was restarted, able to refresh the token from the modified secret

So this is to just share ideas on one way token refresh could be achieved when using the auth.secretName validators mechanism

phbelitz commented 2 months ago

@edison-vflow yikes, thank you for that very thorough digging into the problem. if i understand correctly this whole ordeal needn't to be done when using the useKeychain feature (which as already correctly reported here https://github.com/sse-secure-systems/connaisseur/issues/1766, is wrongly documented), am i right?

edison-vflow commented 2 months ago

Hi @phbelitz This issue is independent of https://github.com/sse-secure-systems/connaisseur/issues/1766

Connaisseur supports multiple mechanisms to login to your repositories when it is doing cosign validation

We tried 2 of them:

  1. Using k8s_keychain https://sse-secure-systems.github.io/connaisseur/v3.6.1/validators/sigstore_cosign/#k8s_keychain
  2. Using dockerconfigjson https://sse-secure-systems.github.io/connaisseur/v3.6.1/validators/sigstore_cosign/#dockerconfigjson

The issue raised here is for 2. Using dockerconfigjson

When using the dockerconfigjson mechanism where you specify auth.secretName as your authentication mechanism to the docker image repositories, we picked up that Connaisseur reads correctly the authentication details from the secret that you give it via auth.secretName

Connaisseur then keeps the authentication details it read from the secret in memory for as long as its pods are running. However, the credentials of most docker repositories expire after 6 to 12 hrs. When the credentials expire, Connaisseur starts failing validation because it has the stale credentials. Connaisseur has no mechanism to auto refresh the credentials loaded in the secret it is referencing, thats why we start having validation failures after token expiration.

The interesting thing is that, on our side, we have a mechanism where we fetch new credentials before they expire, so that the secret is repopulated. So when Connaisseur starts failing validation, due to expired credentials, if we manually restart the Connaisseur deployment, it is able to start working again because it gets the updated values from the secret.

In Kubernetes clusters, this is a common issue that a deployment may need to restart when its associated configMaps or secrets are updated. The issue that Connaisseur is not able to pick up the refreshed tokens could be resolved if Connaisseur Kubernetes deployment allowed modifications to its annotations, so that tools like reloader can restart Connaisseur when its associated secrets are changed. Connaisseur Kubernetes annotations are not modifiable according this code connaisseur/charts/connaisseur/templates/deployment.yaml We should be able to allow custom annotations coming in from values.yaml, see my comments in first messages

-- Issue 1. Using k8s_keychain https://sse-secure-systems.github.io/connaisseur/v3.6.1/validators/sigstore_cosign/#k8s_keychain is a separate issue whereby when you are using the keychain authentication mechanism, the code implementation is different from the documentation and we need the docs to be updated because that capability is to use k8s_keychain is now broken, if you follow the public documentation guidelines

phbelitz commented 2 months ago

@edison-vflow ok, I see three options:

  1. implement the acquisition of renewed tokens inside Connaisseur (which in parts is already implemented by useKeychain), which takes a lot of codeing and time. don't like it.
  2. let Connaisseur read k8s secrets, instead of mounting them. this way Connaisseur will always pick up the newest version of a secret and not remount it after a restart. but this would also mean Connaisseur needs more permissions (ability to read secrets in Connaisseur namespace through API) and this doesn't feel like the way kubernetes intended things to work. so ideally i wanna pass on that too.
  3. your annotation option. i have zero issues with adding customizable annotation to the connaisseur deployment. we could have done that earlier, the use case for that just never popped up (until now).

so i guess i'll prepare a PR with customizable annotations.

edison-vflow commented 2 months ago

@edison-vflow ok, I see three options:

  1. implement the acquisition of renewed tokens inside Connaisseur (which in parts is already implemented by useKeychain), which takes a lot of codeing and time. don't like it.
  2. let Connaisseur read k8s secrets, instead of mounting them. this way Connaisseur will always pick up the newest version of a secret and not remount it after a restart. but this would also mean Connaisseur needs more permissions (ability to read secrets in Connaisseur namespace through API) and this doesn't feel like the way kubernetes intended things to work. so ideally i wanna pass on that too.
  3. your annotation option. i have zero issues with adding customizable annotation to the connaisseur deployment. we could have done that earlier, the use case for that just never popped up (until now).

so i guess i'll prepare a PR with customizable annotations.

@phbelitz , first prize is really to have Connaisseur be self contained. My thinking was that we can expose the customizable annotations as a temporary fix.

We explored this route of customizable annotations when auth.keychain was broken. So we thought to ourselves that, at least auth.secretName is partially working. How can we retrofit the partially working solution to work fully, given that we have minimal access to the codebase. Thats when the idea of using tools like reloader that read a deployment's annotation to restart a deployment when its secrets or configmaps change came in. It was as a workaround to give us full functionality given that we were now stuck, if we wanted to use the latest version as all auth mechanisms we had evaluated were having issues.

I propose that we can enable customizable annotations as a temporary solution to help people who go the auth.secretName route. Otherwise we should go all the way and allow auth.secretName mechanism to refresh tokens by itself. If not, it is better if we do not offer it as a authentication alternative because the complexities in troubleshooting the issues are very high. We will end up with people silently quitting and not using the product if we offer an option that requires other intricacies for it to fully work