sse-secure-systems / connaisseur

An admission controller that integrates Container Image Signature Verification into a Kubernetes cluster
https://sse-secure-systems.github.io/connaisseur/
Apache License 2.0
441 stars 62 forks source link

Better support for gitops based approaches #1220

Closed dbbhat closed 7 months ago

dbbhat commented 1 year ago

Describe the bug

There seems to be a limitation with connaisseur when working with gitOps based approaches. Connaisseur mutates the image: field of the Pod Spec to image-name>:<tag>@SHA256<digest knowing how well gitops is used to deploy K8s cluster. This results in an infinite mutation <-> verification cycle because of the cluster state always being out of sync with the remote repo.

Expected behavior Expected behavior is that Connaisseur supports a validation-mode only where image reference in pod spec is not mutated OR Connaisseur somehow handles GitOps based approaches better.

Optional: Versions (please complete the following information as relevant):

Optional: Additional context

peterthomassen commented 1 year ago

Taking your suggestion further, let's say image:latest is validated by Connaisseur and the annotation is added accordingly. Just after, the :latest tag is updated to correspond to other image content than before, and then the container fires up, as the annotation makes it look like things are OK.

To continue in the proposed direction, this vulnerability would have to be addressed. How do you suggest to do that, without mutating the image reference?

girishmg commented 1 year ago

@peterthomassen I am hoping that deployers are exercising caution and are not using image:latest in Production environments. They are shooting themselves in the foot, if they are doing so. In addition to not being able to rollback to previous versions, they are going to suffer from the drift of their workload running different code in different clusters.

The only option we are seeing to get away with this issue is to author helm charts and various pod-related specs to have image: set to <name:tag@sha256:.....>.. that way connaisseur will mutate the spec with the same value and it also matches with what is present in git repo.. and therefore flux and argocd will be happy and not end up in the infinite cycle that @dbbhat captures in the issue above.

peterthomassen commented 1 year ago

@peterthomassen I am hoping that deployers are exercising caution and are not using image:latest in Production environments.

Sure; latest is just a placeholder. The race condition I described above can occur with any tag.

The only option we are seeing to get away with this issue is to author helm charts and various pod-related specs to have image: set to <name:tag@sha256:.....>.. that way connaisseur will mutate the spec with the same value and it also matches with what is present in git repo.. and therefore flux and argocd will be happy and not end up in the infinite cycle that @dbbhat captures in the issue above.

In this case, you practically won't be able to use any of the advantages of tags, because you'd be pinning the sha; any time the tag would be updated even if only for a minor version upgrade, you would have to update your charts and specs. This does not seem to be a solution to @dbbhat's problem. (And in any case, if it were the solution, it's just a matter of configuration and not a Connaisseur feature request.)

The question on how it could be addressed still stands.

phbelitz commented 1 year ago

Yea this is a tricky situation. Of course we could add an validation only mode, but that will get rid of a major part of the security Connaisseur provides. With that you essentially only verify if there is any signature for the image present, not if the image you are actually deploying has been signed.

The annotation is problematic for the reason @peterthomassen metioned. You open up a gateway for bypassing Connaisseur altogether, making Connaisseur somewhat useless at that point.

For an alternative way I have to get a bit more technical: When we receive the update request from the reconciliation, the request includes a new and old object. The new one will have a image:tag reference as the git repo defines. The old one will have a image:tag@sha256:... reference, since this is the mutate reference that resulted from a Connaisseur validation, which the resource currently has as it is inside the cluster. What we could do is check whether the tags of the old and new one match and if the do, simply skip validation and append the digest of the old to the new one. This way no more requests are send to the registry by Connaisseur and the validation process should be quite quick. Reconciliation will still happen ervery 5 min since the references are not in sync, but if I understood @dbbhat in #1203 correctly, this will happen anyways even if they are in sync. The big problem here is, that it is expected that an image tag will always only correspond to a single digest, somewhat defeating the purpose of the tag.

The solution to the alternative would be, that Connaisseur needs to translate the image tag to a digest beforehand and see whether the new digest and old digest match. If they do, skip validation, if they don't, validate. But that also means Connaisseur again will do requests to the registry every 5 min when the reconciliation happens.

In German we would say: "One has to die one death" which is equivalent to "To be between a rock and a hard place". I don't see a solution where all problems are solved, but maybe one of the mentioned one's suite you the best?

dbbhat commented 1 year ago

@phbelitz as you mentioned, there's a huge problem if Connaisseur makes the assumption that there will be a 1:1 mapping of tag -> digest and always append the same digest even if the image contents have changed with a new tag. And of course, for the other alternative proposed, we would have to make pull requests to the registry on every reconcile which is not so desirable.

@girishmg @phbelitz some interesting findings here - https://github.com/stefanprodan/gitops-kyverno#mutating-deployments-with-kyverno and and https://github.com/stefanprodan/gitops-kyverno#flux-vs-argo-drift-detection

According to the above,

Flux uses Kubernetes server-side apply dry-run to trigger the mutation webhooks, before it runs the drift detection algorithm. With Flux, you don't have to create special rules for each mutation made by admission controllers.

So it looks like Flux should be able to handle this situation, and not detect a drift in case of image reference mutation by Connaisseur since it it supposed to use server-side apply dry-run to trigger the mutation webhooks, before it runs the drift detection.

From what I have observed, Connaisseur mutates resources in case of server side dry-run, but not in client-side dry-run, right? So it is supposed to work as-is unless there is some issue?

phbelitz commented 1 year ago

Yes. It seems to be that way, but that also means that the Connaisseur validation will still occurr every 5 min with Flux, as the server-side apply dry-run still triggers Connaisseur.

Maybe we can add a shortcut 🤔 When we get an update request with a new image without digest and an old image with digest, we try as quick as possible to resolve the digest of the new image. I think the Docker API has a header field on one of its requests that returns the digest for an image. If the digest is the same to the old image, validation is skipped, as the old digest is still up-to-date.

dbbhat commented 1 year ago

@phbelitz @girishmg yes, we verified that Flux does a dry-run on every reconcile and that invokes the Connaisseur webhook which in turn performs the mutation of the image reference.

Yes, the solution you suggestion does sound like it could work. Alternatively, could we implement some kind of in-memory cache within Connaisseur that would maintain the mapping of tag -> digest and return the digest without verification only in case of a dry-run UPDATE request? This would mean that we are assuming that the mapping of tag -> digest would never change, but if we think about this in a system where connaisseur webhook does not exist:

So this behavior is what Connaisseur could mimic? i.e.

dryRunDigestPolicy: Always, Never, IfNotPresent

Since the cache is in-memory, Never and IfNotPresent are the same. Always becomes the current implementation. Downstream for our use-case we use the IfNotPresent option. This way for Connaisseur users who dont want performance optimization, they can continue to use the default settings. How does this sound?

phbelitz commented 1 year ago

Hm. I have one problem with that: You can have multiple pods of Connaisseur running, each one of them would have their own cache and depending on the pod that gets triggered, you might get different results.

Obviously we could introduce an extra external storage (e.g. a redis), but that needs to be sufficiently protected, since security functionality could be bypassed if controlled by an attacker. That's why I'd prefer a stateless solution.

alaypatel07 commented 1 year ago

Hm. I have one problem with that: You can have multiple pods of Connaisseur running, each one of them would have their own cache and depending on the pod that gets triggered, you might get different results.

@phbelitz this is an interesting point, but can we handle the drift in different pods caches via monitoring/alerting or documentation?

The feature suggested by @dbbhat is specifically for performance optimizations of dry-run requests. In my opinion, for users to take benefit of this performance optimization, they need to:

  1. make a trade-off with the tag 1:n feature. i.e. their environments should have 1:1 tag:digest mapping (Performance optimization without this assumption is a different monster issue altogether, should be discussed separately)
  2. in emergency cases where deployed production images need to be updated with a different sha on same tag, users will have to perform manual action (for example, rolling restart of connaisseur) to make sure that caches are reset.

For all other use cases connaisseur will work as-is out of the box. If and when connaissuer does indeed make a performance optimization where all (including dry-run) requests can be run with no significant performance overhead, this policy can be deprecated.

However we believe there are several technical challenges which make the performance optimization for all requests hard. Hence until then I wonder if we can consider implementing this policy based opt-in approach. We can volunteer to help with the implementation of this feature if acceptable

alaypatel07 commented 1 year ago

Maybe we can add a shortcut 🤔 When we get an update request with a new image without digest and an old image with digest, we try as quick as possible to resolve the digest of the new image.

@phbelitz there is one problem with this workflow: @dbbhat and I verified with apiserver audit logs, when the update request comes in from ci-cd system, it does not have the resolved digest sha, since the digest is added by mutating webhook, the request only contains image tag.

I think the Docker API has a header field on one of its requests that returns the digest for an image. If the digest is the same to the old image, validation is skipped, as the old digest is still up-to-date.

If digest is not present in the incoming request then how does connaisseur know if the digest returned by docker is older or newer to skip validation?

phbelitz commented 1 year ago

@alaypatel07 you can do a HEAD request to the registry. here a sample py script (has to be that complicated because of authen tication ☹️):

import datetime
import os
from urllib.parse import quote, urlencode

import jwt
import requests
from requests.auth import HTTPBasicAuth

class Registry:
    def __init__(self, host, auth=None) -> None:
        self.host = host
        self.auth = auth
        self.tokens = {}

    def token(self, repo):
        if token := self.tokens.get(repo):
            exp = jwt.decode(token, options={"verify_signature": False})["exp"]
            if datetime.datetime.fromtimestamp(exp) > datetime.datetime.now():
                return token
        params = {"service": "registry.docker.io", "scope": f"repository:{repo}:pull"}
        url = f'{quote("http://auth.docker.io/token", safe="/:")}?{urlencode(params, safe="/:")}'

        r = requests.get(url, auth=HTTPBasicAuth(*self.auth), timeout=30)
        r.raise_for_status()

        return r.json()["token"]

    def request(self, method, repo, ref, content_type):
        kwargs = {
            "method": method,
            "url": f"https://{self.host}/v2/{repo}/{content_type}/{ref}",
            "headers": {
                "Authorization": f"Bearer {self.token(repo)}",
                "Accept": "application/vnd.docker.distribution.manifest.list.v2+json",
            },
            "timeout": 20,
        }

        print(f"Requesting with method {method} url: {kwargs['url']}")

        r = requests.request(**kwargs)
        r.raise_for_status()

        return r

    def request_manifest(self, method, repo, ref):
        return self.request(method, repo, ref, "manifests")

    def get_digest_from_tag(self, repo: str, ref: str):
        return self.request_manifest("head", repo, ref).headers["docker-content-digest"]

if __name__ == "__main__":
    host = "index.docker.io"
    auth = (os.environ["DOCKER_USER"], os.environ["DOCKER_PASSWORD"])

    registry = Registry(host, auth)
    digest = registry.get_digest_from_tag("library/redis", "alpine")
    print(digest)

That's still a request to the registry, but one that creates the least amount of load, at least compared to doing full signature validation.

alaypatel07 commented 1 year ago

@phbelitz thanks for sharing the sample script, but that still leaves the question open from my previous comment.

Assuming you get the digest from the registry via the HEAD request, how does connaisseur find out if the digest received is validated previously or not? Can you please summarize in few steps, how do you propose to use the logic provided in sample script to modify the webhook workflow?

phbelitz commented 1 year ago

@alaypatel07 Ok, Flux/ArgoCD will try to make a reconciliation of the git repo and the deployed k8s resource. No matter if this is done as a dry-run (this will be the case for Flux) or not, a kubectl apply will be done (either by Flux or ArgoCD). Internally that means, Connaisseur will get an AdmissionReview from the kubeAPI, that states that a resource will be updated. In this AdmissionReview a new and old object are specified, each of them have a image reference in them. The new one corresponds to the image in the git repo, the old one to the deployed k8s resource and thus an already mutated one (mutated by Connaisseur). If we take the new image reference with a tag, which most likely will not have a digest (since the git repo didn't specify it), we can use the HEAD request to quickly translate the tag to a digest. This will be the most up-to-date digest for the given image with tag. If this digest is the same as the digest of the old image, that would mean the image did not change in the mean time. Since this old image corresponds to a resource that is already deployed, it must have gone through image validation and thus through Connaisseur, so we can skip validation and mutate the new image with the digest.

With Flux in place, not drift should be detected afterwards.

alaypatel07 commented 1 year ago

@phbelitz thanks for clarification, I can see how this will work for the given use-case.

phbelitz commented 1 year ago

@dbbhat @alaypatel07 Hoy, there were some more discussions on this matter, but this is the way we will handle things going forward:

  1. There will be validation mode flag for the policy rules that controls whether images are only validated or also mutated. Default will be mutation, since validation only has serious security implications.
  2. There will be a gitops feature option inside the values.yaml. This option will support 2 modes:
    1. resolve: this will be the behavior I described above, where the tag digest translation will happen before any signature validation, to hopefully skip the process should it not be necessary.
    2. cache: we will introduce a redis store alongside connaisseur, which can store tag digest translations. Before signature validation, the cache is used to maybe skip validation altogether, should there be a hit. The cache entries will have expiry times, which can be configured via a ttl field.

That's the theory, in pratice this will take some time as our resources are quite low at the moment :(

phbelitz commented 1 year ago

@dbbhat does disabling automaticChildApproval actually solve your use-case ...? if set to false Connaisseur will only trigger on Pods and ignore any Deployments, ReplicaSets etc. Your git repositories surely only defines the overlaying resources (and not the Pods themselves), which will never be mutated by Connaisseur and thus no diff between the repo and the remote resources arises in the first place. The Pods themselves still get mutated and use the verified digests, so no security is lost.

I'd be curious if this covers all your problems, or whether you still do need some verification on Deployments etc.

Starkteetje commented 7 months ago

Hi @dbbhat @alaypatel07 as @phbelitz foretold a while ago, we've now introduced a resourceValidationMode with the latest release 3.4.0. If set to podsOnly it will only apply mutation on the Pod level, so if you have a Deployment as your initial definition, that won't be mutated, while its Pods are validated as mutated, which should solve the infinite cycle.

If that doesn't work (e.g., because you use Pods instead of Deployments/ReplicaSets), we've included an escape hatch:

The policy rules now support a with.mode option that can be set to mutate or insecureValidateOnly, allowing the mutation of the image reference to be toggled on and off (the default is mutate, meaning references will be mutated; the alternative is considered insecure since it implies that while a trusted image is available, its use is not guaranteed 🤷).

Note that insecure prefix and the comment. If we only validate the image signature that means that there exists a signed image. Since Connaisseur never interacts with the actually pulled images, it cannot guarantee that the pulled image is the signed one. A malicious registry or machine-in-the-middle (breaking your transport security) would be able to serve an unsigned image. So use this only if you have to and with appropriate consideration.

If that doesn't solve your issue, feel free to reopen this issue. Kind regards