🐛 [BUG] NequIP running problem on A100 machine

QuantumMisaka commented 1 year ago

Describe the bug When I use nequip, which is installed by pip, in A100 machine, error will occur:

NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

To Reproduce run nequip-train config/minimal.yaml on A100 machine

Expected behavior Properly running config/minimal.yaml and config/example.yaml

Environment (please complete the following information):

OS: Fedora 32
python version : 3.10.11
python environment (commands are given for python interpreter):
- nequip version : 0.5.6
- e3nn version 0.5.1
- pytorch version 1.12.1
(if relevant) GPU support with CUDA
- cuda version according to PyTorch 11.7

Additional context when I tried: pip install --upgrade torch to update torch to 2.0.0, problem seems to be solved and example running properly

Linux-cpp-lisp commented 1 year ago

Hi @QuantumMisaka ,

You should install PyTorch 1.11.0 with CUDA 11.*, see https://pytorch.org/get-started/previous-versions/#v1110.

QuantumMisaka commented 1 year ago

Hi @Linux-cpp-lisp I know that the problem mainly lies in CUDA and the CUDA version should be 11.*, however there is still something strange. I directly used pip install <source-code> to install from source-code to a newly-created conda-based python env, from the output infomation, cudatoolkit-11.7 was installed, but the error was still there after I did pip install --upgrade torch

Will the problem lies in the process of pip install ?

Linux-cpp-lisp commented 1 year ago

What is here?

For what its worth, I always install PyTorch itself with conda rather than pip, but not necessarily because pip is wrong...

QuantumMisaka commented 1 year ago

PyTorch-1.12.1 with cudatoolkit-11.6 installed by conda can run NequIP successfully. This problem seems to be in pip installation

Linux-cpp-lisp commented 1 year ago

Great, glad to hear!

Please note that we generally only recommend PyTorch 1.11 right now due to the PyTorch bug described in: https://github.com/mir-group/nequip/discussions/311#discussioncomment-5231630.

If you see behavior like this it can be resolved by switching to 1.11; if you don't please consider commenting on that issue so we can get a better sense of the scope of this problem and how it might be mitigated. Thanks!

mir-group / nequip

🐛 [BUG] NequIP running problem on A100 machine #330