Closed ManfeiBai closed 3 days ago
is our GPU CI in 2.5 branch passing?
is our GPU CI in 2.5 branch passing?
according to CI at Oct15:https://github.com/pytorch/xla/commits/r2.5/, GPU CI in 2.5 branch passing now
confirming locally with sudo docker run --shm-size=16G --gpus all --name netnenewnewnewr25py39 --network host -it -d us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0-rc9_3.9_cuda_12.1 bin/bash
too
confirmed locally with newer version GPU dependencies
🐛 Bug
new built GPU docker image for PyTorch/XLA 2.5 with
r2.5
branch, passedimport torch_xla
, passedPJRT_DEVICE=CPU python test/test_train_mp_mnist.py
, failed at mnist test withPJRT_DEVICE=CUDA
: https://gist.github.com/ManfeiBai/f9efab9ce534970b7d9537006ff50a1a8 GPU
:GPU_NUM_DEVICES=8 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py
Failed to shut down the distributed runtime client.torch_xla/csrc/runtime/xla_coordinator.cc:48 : Check failed: dist_runtime_client_->Shutdown().ok()
1 GPU
:GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py
RuntimeError: Bad StatusOr access: FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details.
To Reproduce
Steps to reproduce the behavior:
us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0-rc1_3.9_cuda_12.1 bin/bash
:sudo docker run --shm-size=16G --gpus all --name netnenewnewnewr25py39 --network host -it -d us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0-rc1_3.9_cuda_12.1 bin/bash
git clone -b r2.5 https://github.com/pytorch/xla.git
cd xla
PJRT_DEVICE=CUDA
:GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py
orGPU_NUM_DEVICES=8 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py
orGPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py --num_epochs 2
Environment