pytorch / test-infra

This repository hosts code that supports the testing infrastructure for the PyTorch organization. For example, this repo hosts the logic to track disabled tests and slow tests, as well as our continuation integration jobs HUD/dashboard.
https://hud.pytorch.org/
Other
84 stars 87 forks source link

Wait for docker build #6013

Closed huydhn closed 1 month ago

huydhn commented 1 month ago

This is a short-term mitigation for https://github.com/pytorch/pytorch/issues/141885 in which any changes touching .ci/docker would cause all the builds to fail until docker build workflow finishes building the images.

At the moment, we don't have a good way to tell the build workflow to wait for the new docker image, so my fix here attempts to inject a delay when the action is called by _linux_build. It will wait up to 90 minutes for the Docker build to finish

Testing

https://github.com/pytorch/pytorch/pull/142177

vercel[bot] commented 1 month ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment | Name | Status | Preview | Updated (UTC) | | :--- | :----- | :------ | :------ | | **torchci** | ⬜️ Ignored ([Inspect](https://vercel.com/fbopensource/torchci/DAjnJg9FsEnVBosXTUByovojhAK8)) | [Visit Preview](https://torchci-git-wait-for-docker-build-fbopensource.vercel.app) | Dec 5, 2024 10:35pm |
huydhn commented 1 month ago

Close in favor of https://github.com/pytorch/pytorch/pull/142109

huydhn commented 1 month ago

After chatting with @malfet, let try this one instead because https://github.com/pytorch/pytorch/pull/142109#pullrequestreview-2482524025 adds few more minutes to the workflow TTS

chuanqi129 commented 1 month ago

Hi @huydhn, I noticed that there are some failures in calculate-docker-image step in xpu ci test jobs, for example https://github.com/pytorch/pytorch/actions/runs/12198235184/job/34036392093?pr=140664#step:6:160. I suspect those failure related to this PR changes. Could you please help to double check it?
And another issue is that seems the build job spent more time than before, https://github.com/pytorch/pytorch/actions/runs/12198235184/job/34029552956?pr=140664#step:7:1. Is it expected?

huydhn commented 1 month ago

@chuanqi129 Thank you for the fix in https://github.com/pytorch/pytorch/pull/142298! It's the correct fix. The failure you see actually highlight a problem that was hidden before. Without adding the new Docker image into the docker build workflow, the image will be rebuilt in every build and tests jobs that depend on it, which is a huge waste of time.

Let me take an action item to write a linter check for this to make sure that adding a new Docker images requires a corresponding update to the docker build workflow.