[2.5 release] GPU docker image failed to run mnist test

ManfeiBai commented 1 month ago

🐛 Bug

new built GPU docker image for PyTorch/XLA 2.5 with r2.5 branch, passed import torch_xla, passed PJRT_DEVICE=CPU python test/test_train_mp_mnist.py, failed at mnist test with PJRT_DEVICE=CUDA: https://gist.github.com/ManfeiBai/f9efab9ce534970b7d9537006ff50a1a

8 GPU:
- cmd: GPU_NUM_DEVICES=8 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py
- error: Failed to shut down the distributed runtime client.torch_xla/csrc/runtime/xla_coordinator.cc:48 : Check failed: dist_runtime_client_->Shutdown().ok()
1 GPU:
- cmd: GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py
- error: RuntimeError: Bad StatusOr access: FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details.

To Reproduce

Steps to reproduce the behavior:

get a GPU
create a new docker container with testing GPU docker image us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0-rc1_3.9_cuda_12.1 bin/bash:
- cmd: sudo docker run --shm-size=16G --gpus all --name netnenewnewnewr25py39 --network host -it -d us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0-rc1_3.9_cuda_12.1 bin/bash
install PyTorch/XLA repo:
- cmd: git clone -b r2.5 https://github.com/pytorch/xla.git
change path to PyTorch/XLA repo:
- cmd: cd xla
run mnist test with PJRT_DEVICE=CUDA:
- cmd: GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py or GPU_NUM_DEVICES=8 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py or GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py --num_epochs 2

Environment

Reproducible on XLA backend [CPU/TPU/CUDA]: CUDA
torch_xla version: tag: v2.5.0-rc1
GPU type: V100
GCP info: IMAGE_FAMILY=pytorch-1-12-cu113
GCP info: COUNT=4

JackCaoG commented 1 month ago

is our GPU CI in 2.5 branch passing?

ManfeiBai commented 1 month ago

is our GPU CI in 2.5 branch passing?

according to CI at Oct15:https://github.com/pytorch/xla/commits/r2.5/, GPU CI in 2.5 branch passing now

confirming locally with sudo docker run --shm-size=16G --gpus all --name netnenewnewnewr25py39 --network host -it -d us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0-rc9_3.9_cuda_12.1 bin/bash too

ManfeiBai commented 3 days ago

confirmed locally with newer version GPU dependencies

pytorch / xla