image-promotion hits 429 quota limits

chrischdi commented 6 months ago

What happened:

Image promotion job did run
Image promotion failed due tounexpected status code 429 Too Many Requests

See https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-k8sio-image-promo/1776261613632884736

What you expected to happen:

Image promotion to succeed

How to reproduce it (as minimally and precisely as possible):

Run image promotion, propably multiple ones after another, in this case we (https://github.com/kubernetes-sigs/cluster-api-provider-vsphere) did cut 3 patch releases.

Anything else we need to know?:

This issue did already occur in the past and was reported wrongly at

https://github.com/kubernetes/k8s.io/issues/6431

Ben pointed that:

The image promoter makes a really high amount of API calls because of the approach to image signatures. We have not changed the quotas in the infrastructure projects.

So there may be potential to optimise promo-tools to not require that much API calls and to not exceed the limit.

Environment:

See the prowjob :-)

Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Others:

chrischdi commented 5 months ago

I did try to look through the code a bit:

kpromo normally uses a rate-limiter when using the crane library
when using sigs.k8s.io/release-sdk/sign, to e.g. signAndReplicate (here) , kpromo does not set the transport to add the rate-limiter, because release-sdk does not allow us to.
- release-sdk runs the SignImageInternal function:
- https://github.com/kubernetes-sigs/release-sdk/blob/main/sign/impl.go#L113-L114
- which runs github.com/sigstore/cosign/v2/cmd/cosign/cli/sign.SignCmd(...)
- SignCmd would allow to pass through a Transport (and because of that a RateLimiter) via signOpts.Registry.RegistryClientOpts

chrischdi commented 5 months ago

Instead of adding rate-limiting, the other possibility would be take a look into release-sdk and/or cosign to improve the api calls made.

xmudrii commented 5 months ago

This is a known issue and we're planning a larger refactor of the promo-tools code base, see other issues in this repo for more information.

sbueringer commented 5 months ago

This is a known issue and we're planning a larger refactor of the promo-tools code base, see other issues in this repo for more information.

What is the recommended action when our image promotions are failing with this error? I'm wondering how our users will be affected.

xmudrii commented 5 months ago

What is the recommended action when our image promotions are failing with this error? I'm wondering how our users will be affected.

If promotion fails with error such as:

run `cip run`: promote images: signing images: replicating signatures: copying signature ...

It's generally safe to ignore it. If it fails with any other error, the job should be restarted. You can ping Release Managers in the #release-management Slack channel to restart the job for you.

It shouldn't affect ability to consume images, but signatures might not work properly or at all if this error happens. Unfortunately, there's nothing much we can do at this point, but we hope we'll be able to kick off the promo-tools refactor efforts soon.

cahillsf commented 5 months ago

similar failures in the patch release and minor releases for CAPI today. one patch release failing at the signing stage: https://prow.k8s.io/log?job=post-k8sio-image-promo&id=1780295493562142720

time="18:09:05.150" level=fatal msg="run `cip run`: promote images: signing images: replicating signatures: copying signature us-west2-docker.pkg.dev/k8s-artifacts-prod/images/cluster-api/clusterctl:sha256-e35d576ae8922459d284077fed7b2a49447b4cb835c69312327c52d75dafa8a4.sig to southamerica-west1-docker.pkg.dev/k8s-artifacts-prod/images/cluster-api/clusterctl:sha256-e35d576ae8922459d284077fed7b2a49447b4cb835c69312327c52d75dafa8a4.sig: PUT https://southamerica-west1-docker.pkg.dev/v2/k8s-artifacts-prod/images/cluster-api/clusterctl/manifests/sha256-e35d576ae8922459d284077fed7b2a49447b4cb835c69312327c52d75dafa8a4.sig: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per user' and limit 'Requests per project per user per minute per user' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'. (and 1 more errors)" diff=4.378s
{"component":"entrypoint","error":"wrapped process failed: exit status

and the minor release job failing at filtering edges: https://prow.k8s.io/log?job=post-k8sio-image-promo&id=1780297426096099328

time="18:10:24.256" level=fatal msg="run `cip run`: promote images: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-central1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fcluster-api%2Fclusterctl%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per user' and limit 'Requests per project per user per minute per user' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'." diff=28ms
{"component":"entrypoint","error":"wrapped process failed: exit status

xmudrii commented 5 months ago

The first failure can be ignored, the second job should be restarted. Can you please send a link to the job so that we can restart it?

cahillsf commented 5 months ago

@xmudrii thanks

sorry its this one: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-k8sio-image-promo/1780297426096099328

xmudrii commented 5 months ago

@cahillsf Restarted the job and now it's green https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-k8sio-image-promo/1780300931636662272

cahillsf commented 5 months ago

thanks for your help @xmudrii !

BenTheElder commented 5 months ago

BenTheElder commented 2 months ago

hit this with v1.30 release https://github.com/kubernetes/kubernetes/issues/126170

also the initial promo job didn't report failure, I think? but we didn't have all regions synced

BenTheElder commented 2 months ago

time="19:28:06.925" level=info msg="Registry: gcr.io/k8s-staging-scheduler-plugins Image: controller Got: gcr.io/k8s-staging-scheduler-plugins/controller" diff=141ms time="19:28:07.077" level=fatal msg="run cip run: promote images: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-west1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fsig-storage%2Fsnapshot-controller%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per region' and limit 'Requests per project per region per minute per region' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'." diff=152ms

I'm guessing there is a gap in using the rate-limit aware client.

BenTheElder commented 2 months ago

https://github.com/kubernetes-sigs/promo-tools/issues/842 ?

kubernetes-sigs / promo-tools