Closed qx121 closed 6 months ago
Hi @qx121 I see you're still running version 3.0.0. We know the exception and have since released fixes that should prevent this issue from popping up (or at least significantly reduce its occurrence), so upgrading should resolve it. (Also we're working on a Golang rewrite, which should bring additional performance/stability improvements, which we hope to bring live in the not too distant future)
As for timeout errors: The above error is due to a bug in how we used the asyncio event loop. The exception fails the validation, which should show up as an "unknown error" from the admission request (as there's no dedicated handling for this error). However, the exception should be handled in that generic sense by the server, so it should a) be a generic error, not a timeout error and b) not affect other requests. The only way I could see it actually causing timeouts would be if the exception happened within the server as opposed to our application (thus bypassing its generic error handling), but neither Flask nor cheroot use asyncio, so that shouldn't be the case either.
I'd suggest you upgrade to Connaisseur's latest version and see that fixes your problem with timeouts as well. If not, it'd be interesting to know whether the Pods have restarts when the timeouts happen (indicating that the server is affected) or whether they then coincide with another (or the same) error
@Starkteetje Thanks for the detailed clarification! We would like to fix the timeout issue which directly affects service performance(e.g. when it timeouts, pod creation is affected). It is good to know event loop issue doesn't cause timeout.
The connaisseur itself looks good during the incident. CPU usage is good(it happened around 23:00 in following graph. we deleted connaisseur-webhook shortly after receiving the alert), and there was no connaisseur pod restarting observed.
Is it possible for some reason connaisseur is not reachable via network? Also wondering whether connaisseur throws any log/errors when a validation request is not reached/denied.
On default log level, Connaisseur should log
{"timestamp": "2024-01-30 08:23:38.334711", "message": "successful verification of image \"docker.io/library/redis:latest\"", "taskName": "Task-24", "admission_review": {"user": "kubernetes-admin", "operation": "CREATE", "kind": "Pod", "name": "test", "namespace": "default"}, "image": "docker.io/library/redis:latest"}
or {"timestamp": "2024-01-30 08:08:13.223096", "message": "{'message': 'Unable to get root trust data from dockerhub-basics due to faulty authentication.', 'context': {'notary_name': 'dockerhub-basics', 'detection_mode': False, 'admission_review': {'user': 'kubernetes-admin', 'operation': 'CREATE', 'kind': 'Pod', 'name': 'test', 'namespace': 'default'}, 'image': 'docker.io/securesystemsengineering/some-private-image:latest'}}", "taskName": "Task-1"}
,{"timestamp": "2024-01-30 08:29:34.847554", "message": "Traceback (most recent call last):\n File \"/app/connaisseur/flask_application.py\", line 123, in __async_mutate\n response = await __admit(admission_request, session)\n ... raise TimeoutError from exc_val\nTimeoutError\n", "taskName": "Task-1"}
which will give you the traceback and thus at least some information on where the validation got stuck and will show up as Error from server: admission webhook "connaisseur-svc.connaisseur.svc" denied the request: couldn't retrieve the necessary trust data for verification within 30s. most likely there was a network failure. check connectivity to external servers or retry
from the k8s API feedback{"timestamp": "2024-01-30 08:08:13.224168", "taskName": null, "client_ip": "10.244.0.1", "method": "POST", "path": "/mutate", "query": "timeout=30s", "protocol": "HTTP/1.1", "status_code": "200"}
) irrespective of the time it took, so even if it were to exceed the 30s and thus lead to a timeout error at the k8s endpointFrom these alone you should be able to discern whether Connaisseur received the admission request. Otherwise, you can also switch to debug log level, which will additionally log the incoming admission request, too, which would be helpful to isolate the images that lead to timeouts for you.
However, Connaisseur should have been reachable. If the server would've died, then the readiness and health probes would've failed and you would've seen Pod restarts. Thus, I'd presume the servers were alive. Then to break connectivity, the options I see are that either the webhook, certificate, service or network would've needed a change or disturbance. Connaisseur doesn't change its resources on the fly, so unless there was some manual change, neither the webhook nor the service should've changed. The TLS certificate could have expired and thus failed, but that would be the explicit reason stated by the k8s API on admission failure (and also wouldn't lead to a timeout, but a failure). As for the network inside the cluster, I'd guess that if that failed, you'd have seen way more things failing than just Connaisseur, so it seems unlikely to me.
Some more things that might be interesting for getting to the root of this issue:
@Starkteetje we are not seeing timeout/denied-request in connaisseur log. It is internal images and no deployment occurred at the time of incident.
I think with above info we can locate issue to be connectivity to connaisseur rather than connaisseur pod itself? I checked certs, it is generated with 100 year of expiration so should be good. Wondering whether api-server or any network component will log such type of connectivity problem?
I retrieve logs from multiple clusters. There is no additional error message found. The "event loop" error during incident is the only connaisseur error i have seen.
Hmm, that's not really much to go on. I'd recommend that you perform the Connaisseur upgrade and check whether this happens again as even though I can't see them being connected the Event Loop failures are the only data we have so far. As for additional logs, I'm not deeply into the administration of k8s, but would say your best guess is in the api server or kubelet logs.
@Starkteetje Thank you! We can plan an upgrade to see if it eases the situation. I think v3.3.0 is the suggested new version for upgrade?
Actually v3.3.3 is the newest release, so I'd recommend you upgrade to that
got it. will pick up v3.3.3
Oh i think I found the error that directly corresponds to timeout issue:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/asyncio/tasks.py", line 490, in wait_for
return fut.result()
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/subprocess.py", line 198, in communicate
stdin, stdout, stderr = await tasks.gather(stdin, stdout, stderr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/subprocess.py", line 178, in _read_stream
output = await stream.read()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/streams.py", line 683, in read
block = await self.read(self._limit)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/streams.py", line 690, in read
await self._wait_for_data('read')
File "/usr/local/lib/python3.11/asyncio/streams.py", line 522, in _wait_for_data
await self._waiter
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/connaisseur/flask_application.py", line 110, in __async_mutate
response = await __admit(admission_request, session)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/connaisseur/flask_application.py", line 142, in __admit
await patches
File "/app/connaisseur/flask_application.py", line 211, in __validate_image
trusted_digest = await validator.validate(image, **validator_arguments)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/connaisseur/validators/cosign/cosign_validator.py", line 78, in validate
return CosignValidator.__apply_policy(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/connaisseur/validators/cosign/cosign_validator.py", line 384, in __apply_policy
raise list(vals.values())[0]["error"]
File "/app/connaisseur/validators/cosign/cosign_validator.py", line 134, in __validation_task
values["digest"] = await self.__get_cosign_validated_digests(image, values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/connaisseur/validators/cosign/cosign_validator.py", line 147, in __get_cosign_validated_digests
returncode, stdout, stderr = await self.__validate_using_trust_root(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/connaisseur/validators/cosign/cosign_validator.py", line 287, in __validate_using_trust_root
return await self.__invoke_cosign(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/connaisseur/validators/cosign/cosign_validator.py", line 343, in __invoke_cosign
stdout, stderr = await asyncio.wait_for(
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/tasks.py", line 492, in wait_for
raise exceptions.TimeoutError() from exc
TimeoutError
Ok, that indeed looks like a regular timeout. Since the k8s API allows for a maximum of 30s during admission for the admission controllers like Connaisseur to respond, we explicitly timeout after ~29s to be able to deliver this response.
From the traceback we see that the cancel happended while the Cosign command was running. We're internally calling the Cosign binary as SIGSTORE_NO_CACHE=1 cosign verify --output text ${IMAGE}
and a bunch of other options depending on configuration. Since that part is outside our code base, we do have limited insight. If Connaisseur has log level debug set, we provide the --verbose
flag to the Cosign binary as well, which should help with debugging (though it might be that these logs are lost in case of a timeout as the handling function is cancelled 😅). What you can do otherwise is to run the Cosign binary as a standalone program for your image to benchmark with debugging enabled. If the issue is persistent for the image, then this should shed some light on the root cause.
I'm wondering whether connaisseur relies on network connection(e.g. connect to external registry)?
There is a LOG_LEVEL env in connaisseur, may i set this to debug
so both connaisseur and cosign enter debug mode?
Another info is, the pub key used by cosign was re-created shortly before incident. Not sure what is connaisseur's sympton of missing/wrong cosign pub key.
I'm wondering whether connaisseur relies on network connection(e.g. connect to external registry)?
Yes. Insofar as the underlying signature framework (in this case Cosign) relies on network connection, so does Connaisseur. For Cosign, that's the registry and the transparency log (unless configured otherwise)
There is a LOG_LEVEL env in connaisseur, may i set this to debug so both connaisseur and cosign enter debug mode?
Yes, that's precisely what it will do :+1:
Another info is, the pub key used by cosign was re-created shortly before incident. Not sure what is connaisseur's sympton of missing/wrong cosign pub key.
That's interesting. I have a hunch what this could point to. I'm going to check whether it relates to SCTs 🤔
My hunch turned out to be false. I think I mixed up the SCT of the Fulcio CT logs with Rekor's inclusion proofs and thought there might be a timing mismatch. But apart from SCTs being a thing for Fulcio instead of Rekor, any error there should lead to a Cosign error and thus not lead to the observed timeouts.
Maybe the debug output gives us a few more ideas. Have you tried to call the cosign binary with your image to see whether it a permanent or transient issue?
I think this issue is transient. The connaisseur runs in production environment very well most of the time, the timeout occurs just occasionally(less than once per week).
I thought cosign uses local pub key for verification, wondering how it interacts with container registry.
we have specify this in config: verifyInTransparencyLog: false
. does this mean it will not interact with transparency log?
Well, the verification itself is done locally, but the signature is not present locally at verification time and needs to be fetched. The different signature frameworks have different ways of storing this trust data. Cosign chose to store it inside the registry as an artifact. As such, Cosign will fetch some metadata about the image to pinpoint the location where a signature might be located inside the registry and then fetches it. Only then, does it use the key that Connaisseur provides locally to verify the signature.
Cosign additionally has its transparency log Rekor to publish signatures for transparency. Specifying verifyInTransparencyLog: false
in Connaisseur does pass the --insecure-ignore-tlog
flag to the Cosign binary, which should indeed skip the calls to Rekor.
got it. I thought signature is attached in image binary, but it looks like signature is stored separately inside image registry. We have a signing service and private registry, so the signature should be created by signing service and stored in registry.
If the registry is network unreachable, will cosign has timeout error as above log shows or connaisseur will log this network issue?
Also wondering whether Rekor is usually in the registry or a 3rd location.
Usually, it should error our like GET "$DOMAIN/v2/":dial tcp $DOMAIN:443: connect: connection refused
. However, if some router on the way doesn't reject the traffic, but swallows it like a black hole, then I'd assume a timeout could happen.
Rekor is a separate service, usually on rekor.sigstore.dev (while we allow reconfiguring the URL, we haven't made the knobs available yet to really make it work on a non-standard host without manually mounting the Rekor certificate into Connaisseur). But fundamentally, while image signatures are static artifacts and can thus be stored inside the registry, transparency data is non-static in a way, because to prove that your (static) log entry is part of the current tree the inclusion proof changes with each tree change and thus the data a verifier needs from the Rekor instance cannot simply be stored in a registry and thus the interface for a Rekor instance differs from that of a registry
Got it. We may need to contact registry team next time timeout issue happens.
We have already enabled debug level log in a cluster. It looks like lots of logs are generated(mainly mutate logs since some objects are very big and get fully logged). Hope it can log with detail when cosign timeout happens again.
Since the Cosign timeout is kind of out of Connaisseur's control, I'll close this issue for now. Additionally, with version 3.4.0 we switched to Go as the underlying programming language, so the event loop error specific to Python cannot happen again.
If you find any indication that the Cosign timeouts originate not from the connection but from Connaisseur, feel free to reopen this or open another issue
Describe the bug We deploy connaisseur in k8s cluster as a webhook to validate container images. We occasionally run into connaisseur-webhook timeout problem and find following exception around the same time timeout is observed.
Is this error causing connaisseur non-responding?
Optional: Versions (please complete the following information as relevant):