pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 480 forks source link

[2.5 release] GPU docker image failed to run mnist test #8089

Closed ManfeiBai closed 3 days ago

ManfeiBai commented 1 month ago

🐛 Bug

new built GPU docker image for PyTorch/XLA 2.5 with r2.5 branch, passed import torch_xla, passed PJRT_DEVICE=CPU python test/test_train_mp_mnist.py, failed at mnist test with PJRT_DEVICE=CUDA: https://gist.github.com/ManfeiBai/f9efab9ce534970b7d9537006ff50a1a

To Reproduce

Steps to reproduce the behavior:

  1. get a GPU
  2. create a new docker container with testing GPU docker image us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0-rc1_3.9_cuda_12.1 bin/bash:
    • cmd: sudo docker run --shm-size=16G --gpus all --name netnenewnewnewr25py39 --network host -it -d us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0-rc1_3.9_cuda_12.1 bin/bash
  3. install PyTorch/XLA repo:
    • cmd: git clone -b r2.5 https://github.com/pytorch/xla.git
  4. change path to PyTorch/XLA repo:
    • cmd: cd xla
  5. run mnist test with PJRT_DEVICE=CUDA:
    • cmd: GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py or GPU_NUM_DEVICES=8 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py or GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py --num_epochs 2

Environment

JackCaoG commented 1 month ago

is our GPU CI in 2.5 branch passing?

ManfeiBai commented 1 month ago

is our GPU CI in 2.5 branch passing?

according to CI at Oct15:https://github.com/pytorch/xla/commits/r2.5/, GPU CI in 2.5 branch passing now

confirming locally with sudo docker run --shm-size=16G --gpus all --name netnenewnewnewr25py39 --network host -it -d us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0-rc9_3.9_cuda_12.1 bin/bash too

ManfeiBai commented 3 days ago

confirmed locally with newer version GPU dependencies