pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.44k stars 451 forks source link

CUDA and GPU-Flavoured Docker/Container Image Missing CUDA Support #7689

Open stellarpower opened 1 month ago

stellarpower commented 1 month ago

❓ Questions and Help

Hi,

According to the docs here, the image us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.3.0_3.10_cuda_12.1 should have Cuda 12.1 support for use on a local GPU. I have also tried pulling xla:nightly_3.8_cuda_12.1.

When I start the container (podman run --shm-size=16g --net=host --gpus all us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.3.0_3.10_cuda_12.1), it appears there is no CUDA support compiled in:

# nvidia-smi
bash: nvidia-smi: command not found
# python
>>> import torch, torch_xla
>>> torch.cuda.get_device_name(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 414, in get_device_name
    return get_device_properties(device).name
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 444, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 284, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
>>> print(torch.__version__)
2.3.0 # No CUDA suffix here
>>> print(torch_xla.__version__)
2.3.0 # Or here

Am I missing something here, or has something gone up with these CI builds?

Thanks

JackCaoG commented 1 month ago

@vanbasten23 do you know if we install the cuda version of the torch or cpu version of the torch by default?