Build cuda:12.4.0-cudnn8-devel-ubuntu22.04 docker image and host it in pytorch AWS

atalman commented 2 months ago

Build Nvidia docker image: cuda:12.4.0-cudnn8-devel-ubuntu22.04

See reference issue here: https://gitlab.com/nvidia/container-images/cuda/-/issues/225

Upload to pytorch aws so this workflow can be fixed: pytorch/pytorch/actions/runs/8974959068/job/24648540236?pr=125617

polarathene commented 2 months ago

Alternatively you could just use the existing DockerHub image with cudnn9? Or is that not valid to build/support?

I wasn't aware of existing issues when I saw the CI failure for a PR I'm involved in, but looked into it here: https://github.com/pytorch/pytorch/pull/125632#issuecomment-2097389038

A quick fix is to just have the matrix for docker generate a versionless cudnn portion of the tag. Presumably nvidia may be taking that approach going forward, so if the version of cudnn does not strictly need to be 8, you could relax the major version pin with the docker images? (there is no cudnn9 tag with cuda 12.4 images, only previous minor tag versions).

Otherwise, won't you need to build (or republish) all the nvidia images being used from DockerHub? The CI is failing specifically because it's trying to pull an invalid tag for nvidia/cuda that you request:

--build-arg BASE_IMAGE=nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04

So you need to avoid building that in the docker matrix, and separately build/publish your AWS image, or as I've suggested just add the logic to select the appropriate nvidia/cuda:12.4 image: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04

atalman commented 2 months ago

For cuda 12.2 with cudnn Dockerfile: https://gitlab.com/nvidia/container-images/cuda/-/tree/master/dist/12.2.2/ubuntu2004/devel?ref_type=heads

For cuda 12.4 with no cudnn: https://gitlab.com/nvidia/container-images/cuda/-/tree/master/dist/12.4.0/ubuntu2004/devel?ref_type=heads

polarathene commented 1 month ago

For cuda 12.4.1 with cudnn (9):

pytorch / builder

Build cuda:12.4.0-cudnn8-devel-ubuntu22.04 docker image and host it in pytorch AWS #1811