"RuntimeError: Event loop is closed" error found in connaissuer

qx121 commented 9 months ago

Describe the bug We deploy connaisseur in k8s cluster as a webhook to validate container images. We occasionally run into connaisseur-webhook timeout problem and find following exception around the same time timeout is observed.

 Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7f224ca5a3e0>

Traceback (most recent call last):
   File "/usr/local/lib/python3.11/asyncio/base_subprocess.py", line 126, in __del__
     self.close()
   File "/usr/local/lib/python3.11/asyncio/base_subprocess.py", line 104, in close
     proto.pipe.close()
  File "/usr/local/lib/python3.11/asyncio/unix_events.py", line 765, in close

    self.write_eof()
       File "/usr/local/lib/python3.11/asyncio/unix_events.py", line 751, in write_eof

    self._loop.call_soon(self._call_connection_lost, None)
      File "/usr/local/lib/python3.11/asyncio/base_events.py", line 761, in call_soon

    self._check_closed()

  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 519, in _check_closed

    raise RuntimeError('Event loop is closed')

Is this error causing connaisseur non-responding?

Optional: Versions (please complete the following information as relevant):

OS:
Kubernetes Cluster: 1.26
Notary Server:
Container registry:
Connaisseur: 3.0.0
Other:

Starkteetje commented 9 months ago

Hi @qx121 I see you're still running version 3.0.0. We know the exception and have since released fixes that should prevent this issue from popping up (or at least significantly reduce its occurrence), so upgrading should resolve it. (Also we're working on a Golang rewrite, which should bring additional performance/stability improvements, which we hope to bring live in the not too distant future)

As for timeout errors: The above error is due to a bug in how we used the asyncio event loop. The exception fails the validation, which should show up as an "unknown error" from the admission request (as there's no dedicated handling for this error). However, the exception should be handled in that generic sense by the server, so it should a) be a generic error, not a timeout error and b) not affect other requests. The only way I could see it actually causing timeouts would be if the exception happened within the server as opposed to our application (thus bypassing its generic error handling), but neither Flask nor cheroot use asyncio, so that shouldn't be the case either.

I'd suggest you upgrade to Connaisseur's latest version and see that fixes your problem with timeouts as well. If not, it'd be interesting to know whether the Pods have restarts when the timeouts happen (indicating that the server is affected) or whether they then coincide with another (or the same) error

qx121 commented 9 months ago

@Starkteetje Thanks for the detailed clarification! We would like to fix the timeout issue which directly affects service performance(e.g. when it timeouts, pod creation is affected). It is good to know event loop issue doesn't cause timeout.

The connaisseur itself looks good during the incident. CPU usage is good(it happened around 23:00 in following graph. we deleted connaisseur-webhook shortly after receiving the alert), and there was no connaisseur pod restarting observed.

Is it possible for some reason connaisseur is not reachable via network? Also wondering whether connaisseur throws any log/errors when a validation request is not reached/denied.

Starkteetje commented 9 months ago

On default log level, Connaisseur should log

its validation outcome, e.g. {"timestamp": "2024-01-30 08:23:38.334711", "message": "successful verification of image \"docker.io/library/redis:latest\"", "taskName": "Task-24", "admission_review": {"user": "kubernetes-admin", "operation": "CREATE", "kind": "Pod", "name": "test", "namespace": "default"}, "image": "docker.io/library/redis:latest"} or {"timestamp": "2024-01-30 08:08:13.223096", "message": "{'message': 'Unable to get root trust data from dockerhub-basics due to faulty authentication.', 'context': {'notary_name': 'dockerhub-basics', 'detection_mode': False, 'admission_review': {'user': 'kubernetes-admin', 'operation': 'CREATE', 'kind': 'Pod', 'name': 'test', 'namespace': 'default'}, 'image': 'docker.io/securesystemsengineering/some-private-image:latest'}}", "taskName": "Task-1"},
or a timeout decision, e.g. {"timestamp": "2024-01-30 08:29:34.847554", "message": "Traceback (most recent call last):\n File \"/app/connaisseur/flask_application.py\", line 123, in __async_mutate\n response = await __admit(admission_request, session)\n ... raise TimeoutError from exc_val\nTimeoutError\n", "taskName": "Task-1"} which will give you the traceback and thus at least some information on where the validation got stuck and will show up as Error from server: admission webhook "connaisseur-svc.connaisseur.svc" denied the request: couldn't retrieve the necessary trust data for verification within 30s. most likely there was a network failure. check connectivity to external servers or retry from the k8s API feedback
and the HTTP responses it sends out (something like {"timestamp": "2024-01-30 08:08:13.224168", "taskName": null, "client_ip": "10.244.0.1", "method": "POST", "path": "/mutate", "query": "timeout=30s", "protocol": "HTTP/1.1", "status_code": "200"}) irrespective of the time it took, so even if it were to exceed the 30s and thus lead to a timeout error at the k8s endpoint

From these alone you should be able to discern whether Connaisseur received the admission request. Otherwise, you can also switch to debug log level, which will additionally log the incoming admission request, too, which would be helpful to isolate the images that lead to timeouts for you.

However, Connaisseur should have been reachable. If the server would've died, then the readiness and health probes would've failed and you would've seen Pod restarts. Thus, I'd presume the servers were alive. Then to break connectivity, the options I see are that either the webhook, certificate, service or network would've needed a change or disturbance. Connaisseur doesn't change its resources on the fly, so unless there was some manual change, neither the webhook nor the service should've changed. The TLS certificate could have expired and thus failed, but that would be the explicit reason stated by the k8s API on admission failure (and also wouldn't lead to a timeout, but a failure). As for the network inside the cluster, I'd guess that if that failed, you'd have seen way more things failing than just Connaisseur, so it seems unlikely to me.

Some more things that might be interesting for getting to the root of this issue:

Did you get the generic timeout error from k8s or did you get the explicit "couldn't retrieve the necessary trust data for verification within 30s..." timeout from Connaisseur? If the latter, can you share the traceback?
Image: If you know the image and it is public, I'd be interested in what it is and with which validator (and key) you're validating it, so I can check it. If it's private but known to you, I'd appreciate if you could try it (maybe with debug logging) and provide us any indications in the logs as to what might go wrong.
Did the deployment you performed at the time include an update to Connaisseur? An "old" Connaisseur pod taking on an admission request and then being shutdown during admission might explain the behaviour + non-conclusive logs
Was the deployment just a single image or many different resources at the same time? (Resource starvation might cause Connaisseur to be slow, but the outcome should still be in the logs or it should lead to restarts and the CPU graph doesn't look like there was any issue; so I don't think this'll lead to much)

qx121 commented 9 months ago

@Starkteetje we are not seeing timeout/denied-request in connaisseur log. It is internal images and no deployment occurred at the time of incident.

I think with above info we can locate issue to be connectivity to connaisseur rather than connaisseur pod itself? I checked certs, it is generated with 100 year of expiration so should be good. Wondering whether api-server or any network component will log such type of connectivity problem?

qx121 commented 9 months ago

I retrieve logs from multiple clusters. There is no additional error message found. The "event loop" error during incident is the only connaisseur error i have seen.

Starkteetje commented 9 months ago

Hmm, that's not really much to go on. I'd recommend that you perform the Connaisseur upgrade and check whether this happens again as even though I can't see them being connected the Event Loop failures are the only data we have so far. As for additional logs, I'm not deeply into the administration of k8s, but would say your best guess is in the api server or kubelet logs.

qx121 commented 9 months ago

@Starkteetje Thank you! We can plan an upgrade to see if it eases the situation. I think v3.3.0 is the suggested new version for upgrade?

Starkteetje commented 9 months ago

Actually v3.3.3 is the newest release, so I'd recommend you upgrade to that

qx121 commented 9 months ago

got it. will pick up v3.3.3

qx121 commented 9 months ago

Oh i think I found the error that directly corresponds to timeout issue:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/asyncio/tasks.py", line 490, in wait_for
    return fut.result()
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/subprocess.py", line 198, in communicate
    stdin, stdout, stderr = await tasks.gather(stdin, stdout, stderr)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/subprocess.py", line 178, in _read_stream
    output = await stream.read()
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/streams.py", line 683, in read
    block = await self.read(self._limit)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/streams.py", line 690, in read
    await self._wait_for_data('read')
  File "/usr/local/lib/python3.11/asyncio/streams.py", line 522, in _wait_for_data
    await self._waiter
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/connaisseur/flask_application.py", line 110, in __async_mutate
    response = await __admit(admission_request, session)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/connaisseur/flask_application.py", line 142, in __admit
    await patches
  File "/app/connaisseur/flask_application.py", line 211, in __validate_image
    trusted_digest = await validator.validate(image, **validator_arguments)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/connaisseur/validators/cosign/cosign_validator.py", line 78, in validate
    return CosignValidator.__apply_policy(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/connaisseur/validators/cosign/cosign_validator.py", line 384, in __apply_policy
    raise list(vals.values())[0]["error"]
  File "/app/connaisseur/validators/cosign/cosign_validator.py", line 134, in __validation_task
    values["digest"] = await self.__get_cosign_validated_digests(image, values)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/connaisseur/validators/cosign/cosign_validator.py", line 147, in __get_cosign_validated_digests
    returncode, stdout, stderr = await self.__validate_using_trust_root(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/connaisseur/validators/cosign/cosign_validator.py", line 287, in __validate_using_trust_root
    return await self.__invoke_cosign(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/connaisseur/validators/cosign/cosign_validator.py", line 343, in __invoke_cosign
    stdout, stderr = await asyncio.wait_for(
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/tasks.py", line 492, in wait_for
    raise exceptions.TimeoutError() from exc
TimeoutError

Starkteetje commented 9 months ago

Ok, that indeed looks like a regular timeout. Since the k8s API allows for a maximum of 30s during admission for the admission controllers like Connaisseur to respond, we explicitly timeout after ~29s to be able to deliver this response.

From the traceback we see that the cancel happended while the Cosign command was running. We're internally calling the Cosign binary as SIGSTORE_NO_CACHE=1 cosign verify --output text ${IMAGE} and a bunch of other options depending on configuration. Since that part is outside our code base, we do have limited insight. If Connaisseur has log level debug set, we provide the --verbose flag to the Cosign binary as well, which should help with debugging (though it might be that these logs are lost in case of a timeout as the handling function is cancelled 😅). What you can do otherwise is to run the Cosign binary as a standalone program for your image to benchmark with debugging enabled. If the issue is persistent for the image, then this should shed some light on the root cause.

qx121 commented 9 months ago

I'm wondering whether connaisseur relies on network connection(e.g. connect to external registry)?

There is a LOG_LEVEL env in connaisseur, may i set this to debug so both connaisseur and cosign enter debug mode?

Another info is, the pub key used by cosign was re-created shortly before incident. Not sure what is connaisseur's sympton of missing/wrong cosign pub key.

Starkteetje commented 9 months ago

I'm wondering whether connaisseur relies on network connection(e.g. connect to external registry)?

Yes. Insofar as the underlying signature framework (in this case Cosign) relies on network connection, so does Connaisseur. For Cosign, that's the registry and the transparency log (unless configured otherwise)

There is a LOG_LEVEL env in connaisseur, may i set this to debug so both connaisseur and cosign enter debug mode?

Yes, that's precisely what it will do :+1:

Another info is, the pub key used by cosign was re-created shortly before incident. Not sure what is connaisseur's sympton of missing/wrong cosign pub key.

That's interesting. I have a hunch what this could point to. I'm going to check whether it relates to SCTs 🤔

Starkteetje commented 9 months ago

My hunch turned out to be false. I think I mixed up the SCT of the Fulcio CT logs with Rekor's inclusion proofs and thought there might be a timing mismatch. But apart from SCTs being a thing for Fulcio instead of Rekor, any error there should lead to a Cosign error and thus not lead to the observed timeouts.

Maybe the debug output gives us a few more ideas. Have you tried to call the cosign binary with your image to see whether it a permanent or transient issue?

qx121 commented 9 months ago

I think this issue is transient. The connaisseur runs in production environment very well most of the time, the timeout occurs just occasionally(less than once per week).

I thought cosign uses local pub key for verification, wondering how it interacts with container registry.

we have specify this in config: verifyInTransparencyLog: false. does this mean it will not interact with transparency log?

Starkteetje commented 9 months ago

Well, the verification itself is done locally, but the signature is not present locally at verification time and needs to be fetched. The different signature frameworks have different ways of storing this trust data. Cosign chose to store it inside the registry as an artifact. As such, Cosign will fetch some metadata about the image to pinpoint the location where a signature might be located inside the registry and then fetches it. Only then, does it use the key that Connaisseur provides locally to verify the signature.

Cosign additionally has its transparency log Rekor to publish signatures for transparency. Specifying verifyInTransparencyLog: false in Connaisseur does pass the --insecure-ignore-tlog flag to the Cosign binary, which should indeed skip the calls to Rekor.

qx121 commented 9 months ago

got it. I thought signature is attached in image binary, but it looks like signature is stored separately inside image registry. We have a signing service and private registry, so the signature should be created by signing service and stored in registry.

If the registry is network unreachable, will cosign has timeout error as above log shows or connaisseur will log this network issue?

Also wondering whether Rekor is usually in the registry or a 3rd location.

Starkteetje commented 9 months ago

Usually, it should error our like GET "$DOMAIN/v2/":dial tcp $DOMAIN:443: connect: connection refused. However, if some router on the way doesn't reject the traffic, but swallows it like a black hole, then I'd assume a timeout could happen.

Rekor is a separate service, usually on rekor.sigstore.dev (while we allow reconfiguring the URL, we haven't made the knobs available yet to really make it work on a non-standard host without manually mounting the Rekor certificate into Connaisseur). But fundamentally, while image signatures are static artifacts and can thus be stored inside the registry, transparency data is non-static in a way, because to prove that your (static) log entry is part of the current tree the inclusion proof changes with each tree change and thus the data a verifier needs from the Rekor instance cannot simply be stored in a registry and thus the interface for a Rekor instance differs from that of a registry

qx121 commented 9 months ago

Got it. We may need to contact registry team next time timeout issue happens.

We have already enabled debug level log in a cluster. It looks like lots of logs are generated(mainly mutate logs since some objects are very big and get fully logged). Hope it can log with detail when cosign timeout happens again.

Starkteetje commented 6 months ago

Since the Cosign timeout is kind of out of Connaisseur's control, I'll close this issue for now. Additionally, with version 3.4.0 we switched to Go as the underlying programming language, so the event loop error specific to Python cannot happen again.

If you find any indication that the Cosign timeouts originate not from the connection but from Connaisseur, feel free to reopen this or open another issue

sse-secure-systems / connaisseur

"RuntimeError: Event loop is closed" error found in connaissuer #1477