How to deal with image push timeouts for multiple arch builds?

damemi commented 1 year ago

Hi, we have an image push postsubmit job for Descheduler at https://github.com/kubernetes/test-infra/blob/master/config/jobs/image-pushing/k8s-staging-descheduler.yaml

This job builds images for multiple arches, and over the past year or so we've noticed an increasing number of timeouts leading to failed image builds. The timeouts don't seem to be due to any issue other than simply taking a long time to build each image.

We've tried increasing the timeout a few times but even at 30 minutes we're still getting failures (the latest increase comes after bumping our k8s dependencies, maybe related).

It seems clear that we should parallelize these builds, which we tried in https://github.com/kubernetes-sigs/descheduler/pull/1019 by making each arch its own gcb build stage. However, this didn't seem to have any effect.

Is there a recommended way to split up these builds? Any docs or examples would be helpful, thanks

cc @a7i @ingvagabund @knelasevero

damemi commented 1 year ago

/sig testing (not sure which sig this falls under, since it's about the automated CI jobs I assumed sig-testing)

BenTheElder commented 1 year ago

It seems clear that we should parallelize these builds, which we tried in https://github.com/kubernetes-sigs/descheduler/pull/1019 by making each arch its own gcb build stage. However, this didn't seem to have any effect.

GCB stages run in serial unless you set waitFor:

https://cloud.google.com/build/docs/configuring-builds/configure-build-step-order

a7i commented 1 year ago

GCB stages run in serial unless you set waitFor:

Hi @BenTheElder as noted in the Issue, we tried this and it didn't reduce the duration https://github.com/kubernetes-sigs/descheduler/pull/1019

BenTheElder commented 1 year ago

GCB also has different machine sizes, but it takes time to spin up custom machine sizes.

I recommend doing multiple architectures in parallel using buildx and pushing a single multi-arch image directly, FWIW, but that may not improve cold build times.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/test-infra/issues/28258#issuecomment-1575264328): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / test-infra

How to deal with image push timeouts for multiple arch builds? #28258