Spike: find out why the post-k8sio-image-promo job takes so long

LappleApple commented 7 months ago

Objective

Find out why the post-k8sio-image-promo job takes so long (2-3 hours), so that we may find ways to shorten its running time
Draft a brief proposal identifying a few possible ways to solve it, ideally addressing these questions: A) Could we make running the post-k8sio-image-promo job parallel or async? B) Could a 10-min wait per CI job that does the promotion be possible?

Context and things to think about while working on this task

Cutting a K8s releases takes 30-40 mins. Doing four releases (patches) extends the time to ~six hours. Used to take three hours (keep in mind that there's more to the process nowadays).
We need to add more mirrors, but this depends on tradeoffs and length of release time
Parallelization led to rate limits, so in the past we stopped trying to figure out which type of artifact was extending the length of time

ameukam commented 7 months ago

One thing affecting the length of execution is the number of container registries used to serve the images promoted.

From 3 GCR registries to more than 10 AR repositories.

meganwolf0 commented 7 months ago

/assign @meganwolf0

meganwolf0 commented 5 months ago

To try and address this issue, some effort was made to grab a sampling of image promo jobs that were created for various image promotions (see file name + date for reference back to source). The idea was to see what share of the total time was spent in each step of the promotion.

File	get promotion edges	validate signatures	promote images	signing images	total time	# promos
gcp-filestore-csi-driver-01-23.txt	9.6%	1.1%	24.2%	64.8%	0:04:57.240000	20
kueue-01-18.txt	14.3%	1.8%	17.2%	66.5%	0:03:00.802000	20
cluster-api-azure-controller-01-18.txt	13.7%	1.3%	23.0%	61.6%	0:03:44.085000	20
metrics-server-01-23.txt	8.8%	1.0%	16.0%	74.1%	0:04:52.458000	20
ibm-powervs-block-csi-driver-01-26.txt	11.4%	1.1%	18.6%	68.4%	0:04:09.189000	22
ingress-nginx-controller-01-23.txt	10.7%	1.3%	33.0%	54.8%	0:07:04.906000	40
ingress-nginx-controller-01-26.txt	13.1%	1.3%	34.6%	50.8%	0:06:35.702000	44
kubecross-01-19.txt	10.7%	1.4%	60.5%	27.3%	0:15:03.641000	100
kube-cross-01-26.txt	15.6%	2.3%	56.2%	25.6%	0:11:31.794000	110
go-runner-01-19.txt	20.6%	3.1%	20.1%	56.0%	0:08:09.598000	120
debian-base-01-27.txt	21.8%	3.3%	23.8%	50.9%	0:07:38.530000	132
provider-os-01-18.txt	14.6%	2.5%	52.1%	30.7%	0:12:18.403000	140
kube-cross-01-10.txt	5.0%	3.7%	75.5%	15.9%	0:31:11.250000	500

For more images, the actual promotion itself was the greatest share of total time, for fewer images the validating/signing/replicating image portion took up a larger portion of the job.

(Wondering if variability in the time it takes to do these promotions are in part driven by network considerations that may vary day to day? Would it make sense to schedule these for traffic optimization?)

You can see the break down of the pieces of the jobs, for a lot of images, it’s undoubtedly the “promote images” portion that takes the most time. To parallelize some of this work, you’d need to have different jobs making requests so that rate limiting could be circumvented.

(Does multiple machines bypass the rate limiting? Is it limited by user credentials, target registry, or solely from origin IP?)

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 3 weeks ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/promo-tools/issues/1125#issuecomment-2196943280): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / promo-tools

Spike: find out why the post-k8sio-image-promo job takes so long #1125

Objective

Context and things to think about while working on this task