pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
83.83k stars 22.6k forks source link

Pytorch 2.4, Cuda 12.4: RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1720538439675/work/aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=, num_gpus= #131650

Open DKchemistry opened 3 months ago

DKchemistry commented 3 months ago

🐛 Describe the bug

Hi PyTorch community, I have been troubleshooting this for a few days but can't seem to fix it.

Essentially, I want to be able to get the device_name or other properties to dynamically allocate GPUs to a particular job.

I have tried various builds, but it always fails at that call. I can get stuff on GPU - especially by setting CUDA_VISIBLE_DEVICES, but a programmatic solution would be ideal.

Running this:

import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")

if torch.cuda.is_available():
    try:
        torch.cuda.init()
        print(f"CUDA initialized successfully")
        print(f"Number of CUDA devices: {torch.cuda.device_count()}")
        for i in range(torch.cuda.device_count()):
            print(f"Device {i}: {torch.cuda.get_device_name(i)}")
    except Exception as e:
        print(f"Error initializing CUDA: {e}")

    try:
        device = torch.device("cuda:0")
        x = torch.rand(5, 3).to(device)
        print(f"Successfully created tensor on GPU: {x}")
    except Exception as e:
        print(f"Error creating tensor on GPU: {e}")

Returns:

PyTorch version: 2.4.0 CUDA available: True CUDA version: 12.4 Error initializing CUDA: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1720538439675/work/aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=, num_gpus=

Here is my nvidia-smi:

Wed Jul 24 16:03:06 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000000:31:00.0 Off |                   On |
| N/A   34C    P0             43W /  300W |      53MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off |   00000000:4B:00.0 Off |                   On |
| N/A   33C    P0             46W /  300W |      50MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          Off |   00000000:B1:00.0 Off |                   On |
| N/A   68C    P0            285W /  300W |    1722MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe          Off |   00000000:CA:00.0 Off |                   On |
| N/A   68C    P0            275W /  300W |    1636MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA T1000 8GB               Off |   00000000:E3:00.0 Off |                  N/A |
| 38%   43C    P8             N/A /   50W |       0MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    3   0   0  |              15MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    4   0   1  |              12MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    5   0   2  |              12MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    6   0   3  |              12MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    3   0   0  |              12MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    4   0   1  |              12MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    5   0   2  |              12MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    6   0   3  |              12MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    1   0   0  |             808MiB / 40192MiB    | 42      0 |  3   0    2    0    0 |
|                  |                 2MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    2   0   1  |             816MiB / 40192MiB    | 42      0 |  3   0    2    0    0 |
|                  |                 2MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  3    1   0   0  |             826MiB / 40192MiB    | 42      0 |  3   0    2    0    0 |
|                  |                 2MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  3    2   0   1  |             822MiB / 40192MiB    | 42      0 |  3   0    2    0    0 |
|                  |                 2MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    2    1    0     114770      C   ...odinger2024-1/internal/bin/gdesmond        762MiB |
|    2    2    0     115110      C   ...odinger2024-1/internal/bin/gdesmond        782MiB |
|    3    1    0     114479      C   ...odinger2024-1/internal/bin/gdesmond        780MiB |
|    3    2    0     113481      C   ...odinger2024-1/internal/bin/gdesmond        776MiB |
+-----------------------------------------------------------------------------------------+

I have also tried a different version of PyTorch, which I installed using this command from PyTorch.org:

conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch-nightly -c nvidia

In that environment, I get the same error (but strangely, it is not saying I have the right CUDA version).

PyTorch version: 2.5.0.dev20240720
CUDA available: True
CUDA version: 12.1
Error initializing CUDA: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1721461329009/work/aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus=

I have also tried the collect_env.py script in both environments, I think the error is identical, but I will give both either way:

nightly build:

python collect_env.py
Collecting environment information...
Traceback (most recent call last):
  File "/home/dk/miniconda3/envs/PyTorch/lib/python3.11/site-packages/torch/cuda/__init__.py", line 327, in _lazy_init
    queued_call()
  File "/home/dk/miniconda3/envs/PyTorch/lib/python3.11/site-packages/torch/cuda/__init__.py", line 195, in _check_capability
    capability = get_device_capability(d)
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dk/miniconda3/envs/PyTorch/lib/python3.11/site-packages/torch/cuda/__init__.py", line 504, in get_device_capability
    prop = get_device_properties(device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dk/miniconda3/envs/PyTorch/lib/python3.11/site-packages/torch/cuda/__init__.py", line 522, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1721461329009/work/aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus=

"stable" build:

python collect_env.py
Collecting environment information...
Traceback (most recent call last):
  File "/home/dk/miniconda3/envs/debug_pytorch/lib/python3.12/site-packages/torch/cuda/__init__.py", line 327, in _lazy_init
    queued_call()
  File "/home/dk/miniconda3/envs/debug_pytorch/lib/python3.12/site-packages/torch/cuda/__init__.py", line 195, in _check_capability
    capability = get_device_capability(d)
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dk/miniconda3/envs/debug_pytorch/lib/python3.12/site-packages/torch/cuda/__init__.py", line 451, in get_device_capability
    prop = get_device_properties(device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dk/miniconda3/envs/debug_pytorch/lib/python3.12/site-packages/torch/cuda/__init__.py", line 469, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1720538439675/work/aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=, num_gpus=

What is really strange to me is I can use nightly build to run ML code that does indeed go to GPU. The entire code is probably superflous for everyone to deal with, but I will just give the relevant parts:

First, I can echo $CUDA_VISIBLE_DEVICES and see nothing is set. Then I can run my code:

python -u progressive_docking_mtl.py -os 10 -bs 256 -num_units 1500 -dropout 0.2 -learn_rate 0.00001 -bin_array 2 -wt 3 -cf -9.236514914296889 -rec 0.9 -n_it 1 -t_mol 65.994673 --data_path /mnt/data/dk/work/DeepDocking/projects/2RH1_mtl --save_path /mnt/data/dk/work/DeepDocking/projects/2RH1_mtl -n_mol 462000
Relevant environment variables:
CUDA_VISIBLE_DEVICES: Not set
CUDA_DEVICE_ORDER: Not set
Available GPUs (pynvml):
GPU 0: NVIDIA A100 80GB PCIe - Free Memory: 79.09 GB
GPU 1: NVIDIA A100 80GB PCIe - Free Memory: 79.09 GB
GPU 2: NVIDIA A100 80GB PCIe - Free Memory: 77.56 GB
GPU 3: NVIDIA A100 80GB PCIe - Free Memory: 77.52 GB
GPU 4: NVIDIA T1000 8GB - Free Memory: 7.78 GB
Skipping GPU 4 (NVIDIA T1000 8GB) as it's the T1000.
Selected GPU 1 with 79.09 GB free memory.
Relevant environment variables:
CUDA_VISIBLE_DEVICES: 1
CUDA_DEVICE_ORDER: PCI_BUS_ID
PyTorch CUDA information:
PyTorch CUDA device count: 1
PyTorch current device: 0
PyTorch device name: NVIDIA A100 80GB PCIe MIG 1g.20gb
Using device: cuda
Parsing args...

and see that indeed, my process is on the GPU I want:

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0    3    0     781962      C   python                                        296MiB |

Here is that code for reference:

import glob
import gc
import argparse
import os
import random
import sys
import time
import numpy as np
import pandas as pd
from sklearn.metrics import precision_recall_curve, roc_curve, auc

def print_env_vars():
    print("Relevant environment variables:")
    for var in ["CUDA_VISIBLE_DEVICES", "CUDA_DEVICE_ORDER"]:
        print(f"{var}: {os.environ.get(var, 'Not set')}")

print_env_vars()

def select_gpu():
    import pynvml

    pynvml.nvmlInit()

    device_count = pynvml.nvmlDeviceGetCount()
    selected_device = None
    max_free_memory = 0

    print("Available GPUs (pynvml):")
    for i in range(device_count):
        handle = pynvml.nvmlDeviceGetHandleByIndex(i)
        device_name = pynvml.nvmlDeviceGetName(handle)
        if isinstance(device_name, bytes):
            device_name = device_name.decode("utf-8")

        mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
        free_memory = mem_info.free / (1024**3)
        print(f"GPU {i}: {device_name} - Free Memory: {free_memory:.2f} GB")

        if "T1000" in device_name:
            print(f"Skipping GPU {i} ({device_name}) as it's the T1000.")
            continue

        if free_memory > max_free_memory:
            selected_device = i
            max_free_memory = free_memory

    pynvml.nvmlShutdown()

    if selected_device is None:
        print("No suitable GPU found. Exiting.")
        sys.exit(1)

    print(f"Selected GPU {selected_device} with {max_free_memory:.2f} GB free memory.")
    return selected_device

# Select GPU and set environment variable
selected_gpu = select_gpu()
os.environ["CUDA_VISIBLE_DEVICES"] = str(selected_gpu)
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

print_env_vars()

# Now it's safe to import torch
import torch

print("PyTorch CUDA information:")
if torch.cuda.is_available():
    print(f"PyTorch CUDA device count: {torch.cuda.device_count()}")
    print(f"PyTorch current device: {torch.cuda.current_device()}")
    try:
        print(f"PyTorch device name: {torch.cuda.get_device_name(0)}")
    except Exception as e:
        print(f"Error getting device name: {e}")

    device = torch.device("cuda")
else:
    print(
        "CUDA is not available. Please check your PyTorch installation and GPU setup."
    )
    sys.exit(1)

print(f"Using device: {device}")

I am not super savvy but I am very confused as to what is going on. I think I am delirious from trying to manage and troubleshoot these environments, so sorry if this makes little sense! It wasn't until I wrote this bug report that I tried my code again and it is working now - but it kept giving me issues before. Still, it seems strange that the most basic version of get_device() seems to fail - requiring pynvml and all of these os calls to get the GPU allocation I want.

Versions

nightly build:

python collect_env.py
Collecting environment information...
Traceback (most recent call last):
  File "/home/dk/miniconda3/envs/PyTorch/lib/python3.11/site-packages/torch/cuda/__init__.py", line 327, in _lazy_init
    queued_call()
  File "/home/dk/miniconda3/envs/PyTorch/lib/python3.11/site-packages/torch/cuda/__init__.py", line 195, in _check_capability
    capability = get_device_capability(d)
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dk/miniconda3/envs/PyTorch/lib/python3.11/site-packages/torch/cuda/__init__.py", line 504, in get_device_capability
    prop = get_device_properties(device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dk/miniconda3/envs/PyTorch/lib/python3.11/site-packages/torch/cuda/__init__.py", line 522, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1721461329009/work/aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus=

"stable" build:

python collect_env.py
Collecting environment information...
Traceback (most recent call last):
  File "/home/dk/miniconda3/envs/debug_pytorch/lib/python3.12/site-packages/torch/cuda/__init__.py", line 327, in _lazy_init
    queued_call()
  File "/home/dk/miniconda3/envs/debug_pytorch/lib/python3.12/site-packages/torch/cuda/__init__.py", line 195, in _check_capability
    capability = get_device_capability(d)
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dk/miniconda3/envs/debug_pytorch/lib/python3.12/site-packages/torch/cuda/__init__.py", line 451, in get_device_capability
    prop = get_device_properties(device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dk/miniconda3/envs/debug_pytorch/lib/python3.12/site-packages/torch/cuda/__init__.py", line 469, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1720538439675/work/aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=, num_gpus=

cc @seemethere @malfet @osalpekar @atalman @ptrblck @msaroufim

malfet commented 3 months ago

Hmm, I can not reproduce it, though I don't have MIGs...

ptrblck commented 3 months ago

First, I can echo $CUDA_VISIBLE_DEVICES and see nothing is set.

This is wrong as you need to set CUDA_VISIBLE_DEVICES to the MIG slice as shown here: https://github.com/pytorch/pytorch/issues/130181#issuecomment-2243466478

DKchemistry commented 3 months ago

First, I can echo $CUDA_VISIBLE_DEVICES and see nothing is set.

This is wrong as you need to set CUDA_VISIBLE_DEVICES to the MIG slice as shown here: #130181 (comment)

@ptrblck

Thank you, this does make sense. I incorrectly thought you could query from PT directly and then set the device as such, but it's not the case.

FWIW, I've had frequent issues in the past (not very helpful without assoc. code, I know) with the ordering of importing/using python's os module to set a CUDA_VISIBLE_DEVICE and the import of torch itself, is there a reason for this?

DKchemistry commented 3 months ago

Hmm, I can not reproduce it, though I don't have MIGs...

@malfet

Unfortunately, no sudo or admin rights to change the partitions and don't have access to a non-MIG machine atm. However, does this code work for you without setting CUDA_VISIBLE_DEVICES?

ptrblck commented 3 months ago

FWIW, I've had frequent issues in the past (not very helpful without assoc. code, I know) with the ordering of importing/using python's os module to set a CUDA_VISIBLE_DEVICE and the import of torch itself, is there a reason for this?

We generally do not recommend setting CUDA_VISIBLE_DEVICES inside the script as users need to make sure it's set before any CUDA-enabled library initializes the driver or context. The safe approach would be to export it in your terminal or to set it when launching the python process as seen in my examples.

I've created https://github.com/pytorch/pytorch/issues/131667 to improve the error message so thanks for reporting the issue here!

malfet commented 3 months ago

Unfortunately, no sudo or admin rights to change the partitions and don't have access to a non-MIG machine atm. However, does this code work for you without setting CUDA_VISIBLE_DEVICES?

Yes it does:

PyTorch version: 2.4.0
CUDA available: True
CUDA version: 12.4
CUDA initialized successfully
Number of CUDA devices: 8
Device 0: NVIDIA A100-SXM4-40GB
Device 1: NVIDIA A100-SXM4-40GB
Device 2: NVIDIA A100-SXM4-40GB
Device 3: NVIDIA A100-SXM4-40GB
Device 4: NVIDIA A100-SXM4-40GB
Device 5: NVIDIA A100-SXM4-40GB
Device 6: NVIDIA A100-SXM4-40GB
Device 7: NVIDIA A100-SXM4-40GB
Successfully created tensor on GPU: tensor([[0.6762, 0.1642, 0.3304],
        [0.8413, 0.7888, 0.8880],
        [0.0072, 0.8133, 0.7192],
        [0.9159, 0.1124, 0.2044],
        [0.4343, 0.2635, 0.5052]], device='cuda:0')
DKchemistry commented 3 months ago

Okay, sorry to continue on here, but now the same code stopped working with no changes.

I followed @ptrblck advice to add:

GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-0159b645-5c9b-d8ec-3ab3-ecc0dc8ee513)
export CUDA_VISIBLE_DEVICES=GPU-0159b645-5c9b-d8ec-3ab3-ecc0dc8ee513

But it is STILL going for the T1000:

python -u progressive_docking_mtl.py -os 10 -bs 256 -num_units 1500 -dropout 0.2 -learn_rate 0.0001 -bin_array 2 -wt 3 -cf -9.236514914296889 -rec 0.9 -n_it 1 -t_mol 65.994673 --data_path /mnt/data/dk/work/DeepDocking/projects/2RH1_mtl --save_path /mnt/data/dk/work/DeepDocking/projects/2RH1_mtl -n_mol 462000
Relevant environment variables:
CUDA_VISIBLE_DEVICES: GPU-0159b645-5c9b-d8ec-3ab3-ecc0dc8ee513
CUDA_DEVICE_ORDER: Not set
Available GPUs (pynvml):
GPU 0: NVIDIA A100 80GB PCIe - Free Memory: 79.09 GB
GPU 1: NVIDIA A100 80GB PCIe - Free Memory: 79.09 GB
GPU 2: NVIDIA A100 80GB PCIe - Free Memory: 77.56 GB
GPU 3: NVIDIA A100 80GB PCIe - Free Memory: 77.54 GB
GPU 4: NVIDIA T1000 8GB - Free Memory: 7.78 GB
Skipping GPU 4 (NVIDIA T1000 8GB) as it's the T1000.
Selected GPU 0 with 79.09 GB free memory.
Relevant environment variables:
CUDA_VISIBLE_DEVICES: 0
CUDA_DEVICE_ORDER: PCI_BUS_ID
PyTorch CUDA information:
PyTorch CUDA device count: 1
PyTorch current device: 0
PyTorch device name: NVIDIA T1000 8GB

I figured maybe a mig issue, so:

GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-0159b645-5c9b-d8ec-3ab3-ecc0dc8ee513)
  MIG 1g.20gb     Device  0: (UUID: MIG-0f31f4c6-9357-5c01-85cb-9a845d6fad1c)
  MIG 1g.20gb     Device  1: (UUID: MIG-310fed97-ea9c-5128-91bd-00d25f05cd0e)
export CUDA_VISIBLE_DEVICES=MIG-0f31f4c6-9357-5c01-85cb-9a845d6fad1c
echo $CUDA_VISIBLE_DEVICES
MIG-0f31f4c6-9357-5c01-85cb-9a845d6fad1c

Still takes the T1000.

Removed all the code related to setting the GPU in the script:

import glob
import gc
import argparse
import os
import random
import sys
import time
import numpy as np
import pandas as pd
from sklearn.metrics import precision_recall_curve, roc_curve, auc

# def print_env_vars():
#     print("Relevant environment variables:")
#     for var in ["CUDA_VISIBLE_DEVICES", "CUDA_DEVICE_ORDER"]:
#         print(f"{var}: {os.environ.get(var, 'Not set')}")

# print_env_vars()

# def select_gpu():
#     import pynvml

#     pynvml.nvmlInit()

#     device_count = pynvml.nvmlDeviceGetCount()
#     selected_device = None
#     max_free_memory = 0

#     print("Available GPUs (pynvml):")
#     for i in range(device_count):
#         handle = pynvml.nvmlDeviceGetHandleByIndex(i)
#         device_name = pynvml.nvmlDeviceGetName(handle)
#         if isinstance(device_name, bytes):
#             device_name = device_name.decode("utf-8")

#         mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
#         free_memory = mem_info.free / (1024**3)
#         print(f"GPU {i}: {device_name} - Free Memory: {free_memory:.2f} GB")

#         if "T1000" in device_name:
#             print(f"Skipping GPU {i} ({device_name}) as it's the T1000.")
#             continue

#         if free_memory > max_free_memory:
#             selected_device = i
#             max_free_memory = free_memory

#     pynvml.nvmlShutdown()

#     if selected_device is None:
#         print("No suitable GPU found. Exiting.")
#         sys.exit(1)

#     print(f"Selected GPU {selected_device} with {max_free_memory:.2f} GB free memory.")
#     return selected_device

# # Select GPU and set environment variable
# selected_gpu = select_gpu()
# os.environ["CUDA_VISIBLE_DEVICES"] = str(selected_gpu)
# os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

# print_env_vars()

# Now it's safe to import torch
import torch

print("PyTorch CUDA information:")
if torch.cuda.is_available():
    print(f"PyTorch CUDA device count: {torch.cuda.device_count()}")
    print(f"PyTorch current device: {torch.cuda.current_device()}")
    try:
        print(f"PyTorch device name: {torch.cuda.get_device_name(0)}")
    except Exception as e:
        print(f"Error getting device name: {e}")

    device = torch.device("cuda")
else:
    print(
        "CUDA is not available. Please check your PyTorch installation and GPU setup."
    )
    sys.exit(1)

print(f"Using device: {device}")

Now it does work:

python -u progressive_docking_mtl.py -os 10 -bs 256 -num_units 1500 -dropout 0.2 -learn_rate 0.0001 -bin_array 2 -wt 3 -cf -9.236514914296889 -rec 0.9 -n_it 1 -t_mol 65.994673 --data_path /mnt/data/dk/work/DeepDocking/projects/2RH1_mtl --save_path /mnt/data/dk/work/DeepDocking/projects/2RH1_mtl -n_mol 462000
PyTorch CUDA information:
PyTorch CUDA device count: 1
PyTorch current device: 0
PyTorch device name: NVIDIA A100 80GB PCIe MIG 1g.20gb
Using device: cuda

But why did it work before? Why did it stop working? How should I programatically select a GPU for PyTorch to run in in a group environment when people don't want me running my PyTorch code on their GPU's being used for other things? Or I do I manually have to check and edit in the specific MIG I want everytime, for every job? I just feel like there must be a better approach than what I am doing here, or fundamentally I am misunderstanding something. If you have any advice, I would really appreciate it! @ptrblck @malfet

atalman commented 3 months ago

As per discussion with @ptrblck we suggest to implement a warning for this use case for release 2.4.1