pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
1.28k stars 115 forks source link

Add a workflow to build torchtitan-ubuntu-20.04-clang12 Docker image for CI #338

Closed huydhn closed 1 month ago

huydhn commented 1 month ago

Adopt from PyTorch, this workflow will prepare the Docker image torchtitan-ubuntu-20.04-clang12 for the CI.

torchtitan-ubuntu-20.04-clang12 can then be used as the input for docker-image.

wconstab commented 1 month ago

Looking at the CI results from this PR, it looks like it spent 6 minutes in 'calculate docker image' step. If I look inside there it looks like it's building the image. I guess that's automated such that it would build a new image automatically if it detected a change in the build scripts (e.g. if requirements.txt got updated), but then it hits the cache and skips the build time normally?

wconstab commented 1 month ago

Re-running the job now to observe how it behaves on the second run.

wconstab commented 1 month ago

on rerun, i do see the 'docker build' step go down from 6 min to 1 sec, so that's great!

image

but the docker pull step is taking over 3min, yesterday I was seeing 1m30s roughly, for the pytorch base image you had sent me. I wonder if this is something related to docker cache- should we expect the pull step to decrease once the same runner is used a second time and its cache is warm? if so then i think this will be alright.

huydhn commented 1 month ago

Looking at the CI results from this PR, it looks like it spent 6 minutes in 'calculate docker image' step. If I look inside there it looks like it's building the image. I guess that's automated such that it would build a new image automatically if it detected a change in the build scripts (e.g. if requirements.txt got updated), but then it hits the cache and skips the build time normally?

Yup, you're right.