Closed dbbhat closed 7 months ago
Taking your suggestion further, let's say image:latest
is validated by Connaisseur and the annotation is added accordingly. Just after, the :latest
tag is updated to correspond to other image content than before, and then the container fires up, as the annotation makes it look like things are OK.
To continue in the proposed direction, this vulnerability would have to be addressed. How do you suggest to do that, without mutating the image reference?
@peterthomassen I am hoping that deployers are exercising caution and are not using image:latest
in Production environments. They are shooting themselves in the foot, if they are doing so. In addition to not being able to rollback to previous versions, they are going to suffer from the drift of their workload running different code in different clusters.
The only option we are seeing to get away with this issue is to author helm charts and various pod-related specs to have image:
set to <name:tag@sha256:.....>
.. that way connaisseur will mutate the spec with the same value and it also matches with what is present in git repo.. and therefore flux and argocd will be happy and not end up in the infinite cycle that @dbbhat captures in the issue above.
@peterthomassen I am hoping that deployers are exercising caution and are not using
image:latest
in Production environments.
Sure; latest
is just a placeholder. The race condition I described above can occur with any tag.
The only option we are seeing to get away with this issue is to author helm charts and various pod-related specs to have
image:
set to<name:tag@sha256:.....>
.. that way connaisseur will mutate the spec with the same value and it also matches with what is present in git repo.. and therefore flux and argocd will be happy and not end up in the infinite cycle that @dbbhat captures in the issue above.
In this case, you practically won't be able to use any of the advantages of tags, because you'd be pinning the sha; any time the tag would be updated even if only for a minor version upgrade, you would have to update your charts and specs. This does not seem to be a solution to @dbbhat's problem. (And in any case, if it were the solution, it's just a matter of configuration and not a Connaisseur feature request.)
The question on how it could be addressed still stands.
Yea this is a tricky situation. Of course we could add an validation only mode, but that will get rid of a major part of the security Connaisseur provides. With that you essentially only verify if there is any signature for the image present, not if the image you are actually deploying has been signed.
The annotation is problematic for the reason @peterthomassen metioned. You open up a gateway for bypassing Connaisseur altogether, making Connaisseur somewhat useless at that point.
For an alternative way I have to get a bit more technical: When we receive the update request from the reconciliation, the request includes a new and old object. The new one will have a image:tag
reference as the git repo defines. The old one will have a image:tag@sha256:...
reference, since this is the mutate reference that resulted from a Connaisseur validation, which the resource currently has as it is inside the cluster. What we could do is check whether the tags of the old and new one match and if the do, simply skip validation and append the digest of the old to the new one. This way no more requests are send to the registry by Connaisseur and the validation process should be quite quick. Reconciliation will still happen ervery 5 min since the references are not in sync, but if I understood @dbbhat in #1203 correctly, this will happen anyways even if they are in sync. The big problem here is, that it is expected that an image tag will always only correspond to a single digest, somewhat defeating the purpose of the tag.
The solution to the alternative would be, that Connaisseur needs to translate the image tag to a digest beforehand and see whether the new digest and old digest match. If they do, skip validation, if they don't, validate. But that also means Connaisseur again will do requests to the registry every 5 min when the reconciliation happens.
In German we would say: "One has to die one death" which is equivalent to "To be between a rock and a hard place". I don't see a solution where all problems are solved, but maybe one of the mentioned one's suite you the best?
@phbelitz as you mentioned, there's a huge problem if Connaisseur makes the assumption that there will be a 1:1 mapping of tag -> digest and always append the same digest even if the image contents have changed with a new tag. And of course, for the other alternative proposed, we would have to make pull requests to the registry on every reconcile which is not so desirable.
@girishmg @phbelitz some interesting findings here - https://github.com/stefanprodan/gitops-kyverno#mutating-deployments-with-kyverno and and https://github.com/stefanprodan/gitops-kyverno#flux-vs-argo-drift-detection
According to the above,
Flux uses Kubernetes server-side apply dry-run to trigger the mutation webhooks, before it runs the drift detection algorithm. With Flux, you don't have to create special rules for each mutation made by admission controllers.
So it looks like Flux should be able to handle this situation, and not detect a drift in case of image reference mutation by Connaisseur since it it supposed to use server-side apply dry-run to trigger the mutation webhooks, before it runs the drift detection.
From what I have observed, Connaisseur mutates resources in case of server side dry-run, but not in client-side dry-run, right? So it is supposed to work as-is unless there is some issue?
Yes. It seems to be that way, but that also means that the Connaisseur validation will still occurr every 5 min with Flux, as the server-side apply dry-run still triggers Connaisseur.
Maybe we can add a shortcut 🤔 When we get an update request with a new image without digest and an old image with digest, we try as quick as possible to resolve the digest of the new image. I think the Docker API has a header field on one of its requests that returns the digest for an image. If the digest is the same to the old image, validation is skipped, as the old digest is still up-to-date.
@phbelitz @girishmg yes, we verified that Flux does a dry-run on every reconcile and that invokes the Connaisseur webhook which in turn performs the mutation of the image reference.
Yes, the solution you suggestion does sound like it could work. Alternatively, could we implement some kind of in-memory cache within Connaisseur that would maintain the mapping of tag -> digest and return the digest without verification only in case of a dry-run UPDATE request? This would mean that we are assuming that the mapping of tag -> digest would never change, but if we think about this in a system where connaisseur webhook does not exist:
IfNotPresent
. It always pulls if policy is Always
. never pulls if policy is Never
So this behavior is what Connaisseur could mimic? i.e.
dryRunDigestPolicy: Always, Never, IfNotPresent
Since the cache is in-memory, Never
and IfNotPresent
are the same. Always
becomes the current implementation. Downstream for our use-case we use the IfNotPresent
option. This way for Connaisseur users who dont want performance optimization, they can continue to use the default settings. How does this sound?
Hm. I have one problem with that: You can have multiple pods of Connaisseur running, each one of them would have their own cache and depending on the pod that gets triggered, you might get different results.
Obviously we could introduce an extra external storage (e.g. a redis), but that needs to be sufficiently protected, since security functionality could be bypassed if controlled by an attacker. That's why I'd prefer a stateless solution.
Hm. I have one problem with that: You can have multiple pods of Connaisseur running, each one of them would have their own cache and depending on the pod that gets triggered, you might get different results.
@phbelitz this is an interesting point, but can we handle the drift in different pods caches via monitoring/alerting or documentation?
The feature suggested by @dbbhat is specifically for performance optimizations of dry-run requests. In my opinion, for users to take benefit of this performance optimization, they need to:
For all other use cases connaisseur will work as-is out of the box. If and when connaissuer does indeed make a performance optimization where all (including dry-run) requests can be run with no significant performance overhead, this policy can be deprecated.
However we believe there are several technical challenges which make the performance optimization for all requests hard. Hence until then I wonder if we can consider implementing this policy based opt-in approach. We can volunteer to help with the implementation of this feature if acceptable
Maybe we can add a shortcut 🤔 When we get an update request with a new image without digest and an old image with digest, we try as quick as possible to resolve the digest of the new image.
@phbelitz there is one problem with this workflow: @dbbhat and I verified with apiserver audit logs, when the update request comes in from ci-cd system, it does not have the resolved digest sha, since the digest is added by mutating webhook, the request only contains image tag.
I think the Docker API has a header field on one of its requests that returns the digest for an image. If the digest is the same to the old image, validation is skipped, as the old digest is still up-to-date.
If digest is not present in the incoming request then how does connaisseur know if the digest returned by docker is older or newer to skip validation?
@alaypatel07 you can do a HEAD request to the registry. here a sample py script (has to be that complicated because of authen tication ☹️):
import datetime
import os
from urllib.parse import quote, urlencode
import jwt
import requests
from requests.auth import HTTPBasicAuth
class Registry:
def __init__(self, host, auth=None) -> None:
self.host = host
self.auth = auth
self.tokens = {}
def token(self, repo):
if token := self.tokens.get(repo):
exp = jwt.decode(token, options={"verify_signature": False})["exp"]
if datetime.datetime.fromtimestamp(exp) > datetime.datetime.now():
return token
params = {"service": "registry.docker.io", "scope": f"repository:{repo}:pull"}
url = f'{quote("http://auth.docker.io/token", safe="/:")}?{urlencode(params, safe="/:")}'
r = requests.get(url, auth=HTTPBasicAuth(*self.auth), timeout=30)
r.raise_for_status()
return r.json()["token"]
def request(self, method, repo, ref, content_type):
kwargs = {
"method": method,
"url": f"https://{self.host}/v2/{repo}/{content_type}/{ref}",
"headers": {
"Authorization": f"Bearer {self.token(repo)}",
"Accept": "application/vnd.docker.distribution.manifest.list.v2+json",
},
"timeout": 20,
}
print(f"Requesting with method {method} url: {kwargs['url']}")
r = requests.request(**kwargs)
r.raise_for_status()
return r
def request_manifest(self, method, repo, ref):
return self.request(method, repo, ref, "manifests")
def get_digest_from_tag(self, repo: str, ref: str):
return self.request_manifest("head", repo, ref).headers["docker-content-digest"]
if __name__ == "__main__":
host = "index.docker.io"
auth = (os.environ["DOCKER_USER"], os.environ["DOCKER_PASSWORD"])
registry = Registry(host, auth)
digest = registry.get_digest_from_tag("library/redis", "alpine")
print(digest)
That's still a request to the registry, but one that creates the least amount of load, at least compared to doing full signature validation.
@phbelitz thanks for sharing the sample script, but that still leaves the question open from my previous comment.
Assuming you get the digest from the registry via the HEAD request, how does connaisseur find out if the digest received is validated previously or not? Can you please summarize in few steps, how do you propose to use the logic provided in sample script to modify the webhook workflow?
@alaypatel07 Ok, Flux/ArgoCD will try to make a reconciliation of the git repo and the deployed k8s resource. No matter if this is done as a dry-run (this will be the case for Flux) or not, a kubectl apply
will be done (either by Flux or ArgoCD). Internally that means, Connaisseur will get an AdmissionReview from the kubeAPI, that states that a resource will be updated. In this AdmissionReview a new and old object are specified, each of them have a image reference in them. The new one corresponds to the image in the git repo, the old one to the deployed k8s resource and thus an already mutated one (mutated by Connaisseur). If we take the new image reference with a tag, which most likely will not have a digest (since the git repo didn't specify it), we can use the HEAD request to quickly translate the tag to a digest. This will be the most up-to-date digest for the given image with tag. If this digest is the same as the digest of the old image, that would mean the image did not change in the mean time. Since this old image corresponds to a resource that is already deployed, it must have gone through image validation and thus through Connaisseur, so we can skip validation and mutate the new image with the digest.
With Flux in place, not drift should be detected afterwards.
@phbelitz thanks for clarification, I can see how this will work for the given use-case.
@dbbhat @alaypatel07 Hoy, there were some more discussions on this matter, but this is the way we will handle things going forward:
gitops
feature option inside the values.yaml
. This option will support 2 modes:
resolve
: this will be the behavior I described above, where the tag digest translation will happen before any signature validation, to hopefully skip the process should it not be necessary.cache
: we will introduce a redis store alongside connaisseur, which can store tag digest translations. Before signature validation, the cache is used to maybe skip validation altogether, should there be a hit. The cache entries will have expiry times, which can be configured via a ttl
field.That's the theory, in pratice this will take some time as our resources are quite low at the moment :(
@dbbhat does disabling automaticChildApproval
actually solve your use-case ...? if set to false
Connaisseur will only trigger on Pods and ignore any Deployments, ReplicaSets etc. Your git repositories surely only defines the overlaying resources (and not the Pods themselves), which will never be mutated by Connaisseur and thus no diff between the repo and the remote resources arises in the first place. The Pods themselves still get mutated and use the verified digests, so no security is lost.
I'd be curious if this covers all your problems, or whether you still do need some verification on Deployments etc.
Hi @dbbhat @alaypatel07
as @phbelitz foretold a while ago, we've now introduced a resourceValidationMode
with the latest release 3.4.0. If set to podsOnly
it will only apply mutation on the Pod level, so if you have a Deployment as your initial definition, that won't be mutated, while its Pods are validated as mutated, which should solve the infinite cycle.
If that doesn't work (e.g., because you use Pods instead of Deployments/ReplicaSets), we've included an escape hatch:
The policy rules now support a
with.mode
option that can be set tomutate
orinsecureValidateOnly
, allowing the mutation of the image reference to be toggled on and off (the default ismutate
, meaning references will be mutated; the alternative is considered insecure since it implies that while a trusted image is available, its use is not guaranteed 🤷).
Note that insecure prefix and the comment. If we only validate the image signature that means that there exists a signed image. Since Connaisseur never interacts with the actually pulled images, it cannot guarantee that the pulled image is the signed one. A malicious registry or machine-in-the-middle (breaking your transport security) would be able to serve an unsigned image. So use this only if you have to and with appropriate consideration.
If that doesn't solve your issue, feel free to reopen this issue. Kind regards
Describe the bug
There seems to be a limitation with connaisseur when working with gitOps based approaches. Connaisseur mutates the
image:
field of the Pod Spec to image-name>:<tag>@SHA256<digest knowing how well gitops is used to deploy K8s cluster. This results in an infinite mutation <-> verification cycle because of the cluster state always being out of sync with the remote repo.Expected behavior Expected behavior is that Connaisseur supports a validation-mode only where image reference in pod spec is not mutated OR Connaisseur somehow handles GitOps based approaches better.
Optional: Versions (please complete the following information as relevant):
Optional: Additional context