torchmd / torchmd-net

Training neural network potentials
MIT License
335 stars 75 forks source link

Support for CUDA capability sm_90 #209

Closed FranklinHu1 closed 10 months ago

FranklinHu1 commented 1 year ago

Hello,

I am training the equivariant transformer and running dynamics using it on some Nvidia H100 GPUs. Overall, the workflow is going fine. However, I do get the following warning at the start of every training session:

/home/frankhu/mambaforge/envs/torchmd-net/lib/python3.10/site-packages/torch/cuda/__init__.py:173: UserWarning:
NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_35 sm_50 sm_60 sm_61 sm_70 sm_75 sm_80 sm_86 compute_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

The CUDA version I am using currently is 12.1, and from the PyTorch website, CUDA 12.1 is only supported with the nightly version of PyTorch, not the stable 2.0.* versions indicated in the environment.yml file. Has torchmd-net been tested with this newer version of PyTorch that supports CUDA 12.1 capability? If so, would it be safe to upgrade to the nightly version of PyTorch without breaking the code?

Thank you!

RaulPPelaez commented 1 year ago

Most functionality I have successfully tested with pytorch nightly, i.e most tests run and all that run pass. Depending on what exactly you need though you might run into dependency issues, since releases for some dependencies are not prepared to live outside conda-forge (e.g NNPOps, but you can install it from source and it is compatible with CUDA 12/torch-nightly AFAIK). I have gotten away with running most tests by installing all dependencies that allow so with pip and compiling NNPOps manually.

RaulPPelaez commented 1 year ago

This environment works for me at the moment of writing:

name: torchmd-net
channels:
  - nvidia
  - pytorch-nightly
  - conda-forge 
dependencies:
  - python<=3.11
  - pip
  - cmake
  - pytorch-cuda=12.1 
  - cuda-toolkit=12.1
  - cuda-compiler=12.1 
  - gxx<12
  - pytorch
  - torchvision
  - torchaudio
  - ninja
  - pip:
       - torch-cluster==1.6.1
       - torch-geometric==2.3.1
       - torch-scatter==2.1.1
       - torch-sparse==0.6.17
       - pytorch-lightning==1.6.3
       - torchmetrics==0.11.4
       - tqdm
       - pytest
       - psutil
       - matplotlib
       - h5py
       - torchani==2.2.3
$ mamba create -f environment.yml

This contains everything required to compile NNPOps too:

$ mamba activate torchmd-net
$ git clone https://github.com/openmm/NNPOps
$ cd NNPOps
$ mkdir build && cd build
$ sed -i 's+14+17+g' ../CMakeLists.txt # Pytorch nightly requires C++17
$ Torch_DIR=$(python -c 'import torch;print(torch.utils.cmake_prefix_path)')  cmake -DCMAKE_BUILD_TYPE=Release ..
$ make -j5 all install

After this you can go back to torchmd-net/tests and try:

$ pytest test*py

All tests pass on our systems

RaulPPelaez commented 10 months ago

Torchmd-net and pytorch are now built for CUDA 12 in conda forge and include sm_90. Closing this.