I followed the instructions provided in the README.md file and the GPU instructions provided in this file and I failed to properly import the torch_xla module.
I finally managed to fix the issue via a hacky approach - will share the steps below:
To Reproduce
Environment
Machine: A3-high
Image project: ml-images
Image family: tf-2-15-gpu-debian-11
Python runtime: Python 3.10 (a new conda environment is created for this)
Cuda: 12.1
Nccl: 2.20.5
python -c "import torch_xla as xla; import torch_xla.core.xla_model as xm; print(xm.get_xla_supported_devices())"
Fails with:
# ***
## File "/opt/conda/envs/pyxla/lib/python3.10/site-packages/torch_xla/__init__.py", line 7, in <module>
## import _XLAC
## ImportError: libpython3.10.so.1.0: cannot open shared object file: No such file or directory
# ***
I did copy the libpython3.10.so.1.0 file to /usr/lib - it fixed the issue:
But then ran into a dependancy error for numpy which was not mentioned in any of the requirements. After installing the numpy package, I managed to run successfully fetch the cuda devices:
python -c "import torch_xla as xla; import torch_xla.core.xla_model as xm;
print(xm.get_xla_supported_devices())"
# ['xla:0', 'xla:1', 'xla:2', 'xla:3', 'xla:4', 'xla:5', 'xla:6', 'xla:7']
In my opinion, this can be simplified and some of the setup steps can be added to the torch-xla setup.
If you are in a conda environment, you may need to add LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/lib:/usr/local/lib:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib" to the cmdline.
🐛 Bug
I followed the instructions provided in the README.md file and the GPU instructions provided in this file and I failed to properly import the torch_xla module.
I finally managed to fix the issue via a hacky approach - will share the steps below:
To Reproduce
Environment
Machine: A3-high Image project: ml-images Image family: tf-2-15-gpu-debian-11 Python runtime: Python 3.10 (a new conda environment is created for this) Cuda: 12.1 Nccl: 2.20.5
Followed these steps:
Then installed the required libraries mentioned in here and in here.
Adding the env variables
Running a simple import:
Fails with:
I did copy the
libpython3.10.so.1.0
file to/usr/lib
- it fixed the issue:But then ran into a dependancy error for
numpy
which was not mentioned in any of the requirements. After installing the numpy package, I managed to run successfully fetch the cuda devices:In my opinion, this can be simplified and some of the setup steps can be added to the torch-xla setup.