mihirkatare / DeepMEM

Deep Learning Implementations for Sustainable Matrix Element Method Calculations [ IRIS-HEP Fellowship ]
Apache License 2.0
2 stars 0 forks source link

Specify requirements for DGX environment #13

Open matthewfeickert opened 3 years ago

matthewfeickert commented 3 years ago

If the current environment is installed on DGX in a clean virtual environment

https://github.com/mihirkatare/DeepMEM/blob/61e5d7ef9e9f097f13ee4e98d55a6611d76cd4c4/requirements.txt#L5

then the user will get torch v1.8.1 — that's fine by itself. However, if the user checks the compatibility with the GPUs available

$ curl -sL https://raw.githubusercontent.com/matthewfeickert/nvidia-gpu-ml-library-test/main/torch_detect_GPU.py | python
PyTorch build CUDA version: 10.2
PyTorch build cuDNN version: 7605
PyTorch build NCCL version: 2708

Number of GPUs found on system: 8
PyTorch has active GPU: True
/raid/projects/feickert/.pyenv/versions/DeepMEM-dev/lib/python3.9/site-packages/torch/cuda/__init__.py:106: UserWarning: 
A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

Active GPU index: 0
Active GPU name: A100-SXM4-40GB

To get a PyTorch release that is compatible with the A100 the user needs to get one of the custom wheels that PyTorch hosts built with CUDA 11

$ nvidia-smi | grep "CUDA Version"
| NVIDIA-SMI 450.142.00   Driver Version: 450.142.00   CUDA Version: 11.0     |
$ python -m pip install --upgrade torch==1.9.0+cu111 --find-links https://download.pytorch.org/whl/torch_stable.html
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.9.0+cu111
  Downloading https://download.pytorch.org/whl/cu111/torch-1.9.0%2Bcu111-cp39-cp39-linux_x86_64.whl (2041.4 MB)
     |████████████████████████████████| 2041.4 MB 20 kB/s 
Requirement already satisfied: typing-extensions in /raid/projects/feickert/.pyenv/versions/3.9.6/envs/DeepMEM-dev/lib/python3.9/site-packages (from torch==1.9.0+cu111) (3.10.0.0)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.9.0
    Uninstalling torch-1.9.0:
      Successfully uninstalled torch-1.9.0
Successfully installed torch-1.9.0+cu111
$ curl -sL https://raw.githubusercontent.com/matthewfeickert/nvidia-gpu-ml-library-test/main/torch_detect_GPU.py | python
PyTorch build CUDA version: 11.1
PyTorch build cuDNN version: 8005
PyTorch build NCCL version: 2708

Number of GPUs found on system: 8
PyTorch has active GPU: True

Active GPU index: 0
Active GPU name: A100-SXM4-40GB

None of that is a big problem, but it might require either a dgx-requirements.txt or some instructions for the user to manually resolve things.

mihirkatare commented 3 years ago

Yep! I had to manually install the correct version of pyTorch and CUDA to fix this but it slipped my mind to change the requirements.txt. Since the project is stable in the newer version, I can just update the requirements.txt itself.

matthewfeickert commented 3 years ago

Since the project is stable in the newer version, I can just update the requirements.txt itself.

So the tricky part here though is that we should avoid locking in general requirements.txt to a specific machine. As torch==1.9.0+cu111 is making assumptions about the CUDA libraries that are available it will probably be required to either have machine specific environment files (e.g. dgx-requirements.txt or dgx-env.yml). Optionally we could create some setup scripts that do some looking for users in the environments they have, but in my personal experience (https://github.com/matthewfeickert/nvidia-gpu-ml-library-test) this is gets tedious and hard to do robustly.

This will eventually become more apparent as we make a deepmem library and then we'll have library dependencies vs. runtime application dependencies (c.f. https://caremad.io/posts/2013/07/setup-vs-requirement/). The library dependencies will give us minimum required APIs for things to work as expected (e.g. torch>=1.8.0) and then our runtime dependencies will define the actual "application" environment (e.g. torch==1.9.0+cu111).

At the moment we've basically be treating everything as application dependencies.