kubernetes-sigs / promo-tools

Container and file artifact promotion tooling for the Kubernetes project
Apache License 2.0
143 stars 72 forks source link

Spike: find out why the post-k8sio-image-promo job takes so long #1125

Closed LappleApple closed 3 weeks ago

LappleApple commented 7 months ago

Objective

Context and things to think about while working on this task

Image

Image

ameukam commented 7 months ago

One thing affecting the length of execution is the number of container registries used to serve the images promoted.

From 3 GCR registries to more than 10 AR repositories.

meganwolf0 commented 7 months ago

/assign @meganwolf0

meganwolf0 commented 5 months ago

To try and address this issue, some effort was made to grab a sampling of image promo jobs that were created for various image promotions (see file name + date for reference back to source). The idea was to see what share of the total time was spent in each step of the promotion.

File get promotion edges validate signatures promote images signing images total time # promos
gcp-filestore-csi-driver-01-23.txt 9.6% 1.1% 24.2% 64.8% 0:04:57.240000 20
kueue-01-18.txt 14.3% 1.8% 17.2% 66.5% 0:03:00.802000 20
cluster-api-azure-controller-01-18.txt 13.7% 1.3% 23.0% 61.6% 0:03:44.085000 20
metrics-server-01-23.txt 8.8% 1.0% 16.0% 74.1% 0:04:52.458000 20
ibm-powervs-block-csi-driver-01-26.txt 11.4% 1.1% 18.6% 68.4% 0:04:09.189000 22
ingress-nginx-controller-01-23.txt 10.7% 1.3% 33.0% 54.8% 0:07:04.906000 40
ingress-nginx-controller-01-26.txt 13.1% 1.3% 34.6% 50.8% 0:06:35.702000 44
kubecross-01-19.txt 10.7% 1.4% 60.5% 27.3% 0:15:03.641000 100
kube-cross-01-26.txt 15.6% 2.3% 56.2% 25.6% 0:11:31.794000 110
go-runner-01-19.txt 20.6% 3.1% 20.1% 56.0% 0:08:09.598000 120
debian-base-01-27.txt 21.8% 3.3% 23.8% 50.9% 0:07:38.530000 132
provider-os-01-18.txt 14.6% 2.5% 52.1% 30.7% 0:12:18.403000 140
kube-cross-01-10.txt 5.0% 3.7% 75.5% 15.9% 0:31:11.250000 500

For more images, the actual promotion itself was the greatest share of total time, for fewer images the validating/signing/replicating image portion took up a larger portion of the job.

(Wondering if variability in the time it takes to do these promotions are in part driven by network considerations that may vary day to day? Would it make sense to schedule these for traffic optimization?)

You can see the break down of the pieces of the jobs, for a lot of images, it’s undoubtedly the “promote images” portion that takes the most time. To parallelize some of this work, you’d need to have different jobs making requests so that rate limiting could be circumvented.

(Does multiple machines bypass the rate limiting? Is it limited by user credentials, target registry, or solely from origin IP?)

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 3 weeks ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/promo-tools/issues/1125#issuecomment-2196943280): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.