Open atalman opened 2 months ago
Alternatively you could just use the existing DockerHub image with cudnn9? Or is that not valid to build/support?
I wasn't aware of existing issues when I saw the CI failure for a PR I'm involved in, but looked into it here: https://github.com/pytorch/pytorch/pull/125632#issuecomment-2097389038
A quick fix is to just have the matrix for docker generate a versionless cudnn
portion of the tag. Presumably nvidia may be taking that approach going forward, so if the version of cudnn does not strictly need to be 8, you could relax the major version pin with the docker images? (there is no cudnn9
tag with cuda 12.4
images, only previous minor tag versions).
Otherwise, won't you need to build (or republish) all the nvidia images being used from DockerHub? The CI is failing specifically because it's trying to pull an invalid tag for nvidia/cuda
that you request:
--build-arg BASE_IMAGE=nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04
So you need to avoid building that in the docker matrix, and separately build/publish your AWS image, or as I've suggested just add the logic to select the appropriate nvidia/cuda:12.4
image: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
For cuda 12.2 with cudnn Dockerfile: https://gitlab.com/nvidia/container-images/cuda/-/tree/master/dist/12.2.2/ubuntu2004/devel?ref_type=heads
For cuda 12.4 with no cudnn: https://gitlab.com/nvidia/container-images/cuda/-/tree/master/dist/12.4.0/ubuntu2004/devel?ref_type=heads
For cuda 12.4.1 with cudnn (9):
Build Nvidia docker image: cuda:12.4.0-cudnn8-devel-ubuntu22.04
See reference issue here: https://gitlab.com/nvidia/container-images/cuda/-/issues/225
Upload to pytorch aws so this workflow can be fixed: pytorch/pytorch/actions/runs/8974959068/job/24648540236?pr=125617