sigstore / cosign

Code signing and transparency for containers and binaries
Apache License 2.0
4.48k stars 547 forks source link

Intermittent connection timeouts when trying to sign container images #2660

Closed nbusseneau closed 4 months ago

nbusseneau commented 1 year ago

Hi,

We've recently started using cosign at Cilium to sign our images during the build process.

Today during a few hours (roughly from 8:00 GMT to 14:00 GMT) we've seen intermittent connection timeouts when trying to sign the images, which looked like this:

Generating ephemeral keys...
Retrieving signed certificate...

        Note that there may be personally identifiable information associated with this signed artifact.
        This may include the email address associated with the account with which you authenticate.
        This information will be used for signing this artifact and will be stored in public transparency logs and cannot be removed later.
Error: signing [quay.io/cilium/docker-plugin-ci@sha256:e5200a8946b2c8eb5ae0bfc6c604f6f820a31ffffe12b20f5f11e9b7c6f0022d]: getting signer: getting key from Fulcio: retrieving cert: client: Post "https://fulcio.sigstore.dev/api/v1/signingCert": dial tcp 104.155.154.165:443: i/o timeout
main.go:62: error during command execution: signing [quay.io/cilium/docker-plugin-ci@sha256:e5200a8946b2c8eb5ae0bfc6c604f6f820a31ffffe12b20f5f11e9b7c6f0022d]: getting signer: getting key from Fulcio: retrieving cert: client: Post "https://fulcio.sigstore.dev/api/v1/signingCert": dial tcp 104.155.154.165:443: i/o timeout

Seemed to happen with both Fulcio (above) and Rekor (below):

Generating ephemeral keys...
Retrieving signed certificate...

        Note that there may be personally identifiable information associated with this signed artifact.
        This may include the email address associated with the account with which you authenticate.
        This information will be used for signing this artifact and will be stored in public transparency logs and cannot be removed later.
Successfully verified SCT...
Error: signing [quay.io/cilium/operator-alibabacloud-ci@sha256:624ffffcff2f02030068250dcd6b5b917233ebad1b6af6cd924205305ff105f8]: signing digest: Post "https://rekor.sigstore.dev/api/v1/log/entries": dial tcp 104.155.154.165:443: i/o timeout
main.go:62: error during command execution: signing [quay.io/cilium/operator-alibabacloud-ci@sha256:624ffffcff2f02030068250dcd6b5b917233ebad1b6af6cd924205305ff105f8]: signing digest: Post "https://rekor.sigstore.dev/api/v1/log/entries": dial tcp 104.155.154.165:443: i/o timeout

Some instances of failed workflow runs with full logs:

This could be due to instability on the GitHub runners side, but might also be due to instability on Fulcio's / Rekor's side. Are we aware of any instability? I've checked https://status.sigstore.dev/ but it seems to say everything is OK.

I've seen #2198, which is likely related, but I was not sure due to the issue not containing logs with more details.

Thanks for your work 😄

haydentherapper commented 1 year ago

I've also seen intermittent failures when querying Rekor.

cc @pwelch, did you see any alerts last night?

pwelch commented 1 year ago

I did wake up to some failed probers overnight which I attributed to the Azure network issues.

I'm leaning towards these 3 builds also having network issues.

nbusseneau commented 1 year ago

Yeah, at first I was thinking that as well but since I'd seen failures even after the Azure network issues were resolved I thought it might not have been related in the end: https://github.com/cilium/cilium/actions/runs/4005699796/jobs/6876372005

But it might also be just that the issue was marked resolved before all runners with the issue were cycled out, so... Feel free to close if you are confident the situation is stable, I can always re-open if I see something strange happening again.