pytorch compatibility with NVIDIA A100

abdush commented 2 years ago

Hello,

I am facing issue when trying to run the bio_embeddings library on NVIDIA A100 GPU.

python
Python 3.8.8 (default, Apr 13 2021, 19:58:26) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.cuda
'10.2'
>>> torch.cuda.is_available()
True
>>> torch.zeros(1).cuda()
/home/users/ahussein/conda/envs/bio_embeddings/lib/python3.8/site-packages/torch/cuda/__init__.py:106: UserWarning: 
NVIDIA A100 80GB PCIe with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100 80GB PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/users/ahussein/conda/envs/bio_embeddings/lib/python3.8/site-packages/torch/_tensor.py", line 203, in __repr__
    return torch._tensor_str._str(self)
  File "/home/users/ahussein/conda/envs/bio_embeddings/lib/python3.8/site-packages/torch/_tensor_str.py", line 406, in _str
    return _str_intern(self)
  File "/home/users/ahussein/conda/envs/bio_embeddings/lib/python3.8/site-packages/torch/_tensor_str.py", line 381, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/home/users/ahussein/conda/envs/bio_embeddings/lib/python3.8/site-packages/torch/_tensor_str.py", line 242, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/home/users/ahussein/conda/envs/bio_embeddings/lib/python3.8/site-packages/torch/_tensor_str.py", line 90, in __init__
    nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
>>> exit()

I did a search and found some answers on stackoverflow where pytorch had to be re-installed with new cuda version (11.x instead of 10.2). However, I am wondering if this is safe to update the pytorch library installed as part of bio_embeddings and what impact it might have on other installed libraries and dependencies. Also is there a recommended way to update pytorch (using conda vs pip)? If this is a common problem, would it be a good idea to provide the fix with the updated pytorch in the bio_embeddings library?

bio_emdeddings library I use is installed using the pip install command.

Thanks in advance.

abdush commented 2 years ago

Updating to cuda 11.1 for the installed torch version fixed the issue. I have tried with the pip command as in the link below.

pip install torch==1.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html

https://pytorch.org/get-started/previous-versions/#v191

sacdallago commented 2 years ago

Woho, lucky you ;)

Yeah, the whole CUDA+torch compatibility matrix is a nightmare to keep up with. I have a PR open on our internal fork of the repo by a student with an update to torch, might make things easier from source.

sacdallago / bio_embeddings

pytorch compatibility with NVIDIA A100 #195