Open DKchemistry opened 3 months ago
Hmm, I can not reproduce it, though I don't have MIGs...
First, I can echo $CUDA_VISIBLE_DEVICES and see nothing is set.
This is wrong as you need to set CUDA_VISIBLE_DEVICES
to the MIG slice as shown here: https://github.com/pytorch/pytorch/issues/130181#issuecomment-2243466478
First, I can echo $CUDA_VISIBLE_DEVICES and see nothing is set.
This is wrong as you need to set
CUDA_VISIBLE_DEVICES
to the MIG slice as shown here: #130181 (comment)
@ptrblck
Thank you, this does make sense. I incorrectly thought you could query from PT directly and then set the device as such, but it's not the case.
FWIW, I've had frequent issues in the past (not very helpful without assoc. code, I know) with the ordering of importing/using python's os module to set a CUDA_VISIBLE_DEVICE
and the import of torch itself, is there a reason for this?
Hmm, I can not reproduce it, though I don't have MIGs...
@malfet
Unfortunately, no sudo or admin rights to change the partitions and don't have access to a non-MIG machine atm. However, does this code work for you without setting CUDA_VISIBLE_DEVICES?
FWIW, I've had frequent issues in the past (not very helpful without assoc. code, I know) with the ordering of importing/using python's os module to set a CUDA_VISIBLE_DEVICE and the import of torch itself, is there a reason for this?
We generally do not recommend setting CUDA_VISIBLE_DEVICES
inside the script as users need to make sure it's set before any CUDA-enabled library initializes the driver or context. The safe approach would be to export
it in your terminal or to set it when launching the python
process as seen in my examples.
I've created https://github.com/pytorch/pytorch/issues/131667 to improve the error message so thanks for reporting the issue here!
Unfortunately, no sudo or admin rights to change the partitions and don't have access to a non-MIG machine atm. However, does this code work for you without setting CUDA_VISIBLE_DEVICES?
Yes it does:
PyTorch version: 2.4.0
CUDA available: True
CUDA version: 12.4
CUDA initialized successfully
Number of CUDA devices: 8
Device 0: NVIDIA A100-SXM4-40GB
Device 1: NVIDIA A100-SXM4-40GB
Device 2: NVIDIA A100-SXM4-40GB
Device 3: NVIDIA A100-SXM4-40GB
Device 4: NVIDIA A100-SXM4-40GB
Device 5: NVIDIA A100-SXM4-40GB
Device 6: NVIDIA A100-SXM4-40GB
Device 7: NVIDIA A100-SXM4-40GB
Successfully created tensor on GPU: tensor([[0.6762, 0.1642, 0.3304],
[0.8413, 0.7888, 0.8880],
[0.0072, 0.8133, 0.7192],
[0.9159, 0.1124, 0.2044],
[0.4343, 0.2635, 0.5052]], device='cuda:0')
Okay, sorry to continue on here, but now the same code stopped working with no changes.
I followed @ptrblck advice to add:
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-0159b645-5c9b-d8ec-3ab3-ecc0dc8ee513)
export CUDA_VISIBLE_DEVICES=GPU-0159b645-5c9b-d8ec-3ab3-ecc0dc8ee513
But it is STILL going for the T1000:
python -u progressive_docking_mtl.py -os 10 -bs 256 -num_units 1500 -dropout 0.2 -learn_rate 0.0001 -bin_array 2 -wt 3 -cf -9.236514914296889 -rec 0.9 -n_it 1 -t_mol 65.994673 --data_path /mnt/data/dk/work/DeepDocking/projects/2RH1_mtl --save_path /mnt/data/dk/work/DeepDocking/projects/2RH1_mtl -n_mol 462000
Relevant environment variables:
CUDA_VISIBLE_DEVICES: GPU-0159b645-5c9b-d8ec-3ab3-ecc0dc8ee513
CUDA_DEVICE_ORDER: Not set
Available GPUs (pynvml):
GPU 0: NVIDIA A100 80GB PCIe - Free Memory: 79.09 GB
GPU 1: NVIDIA A100 80GB PCIe - Free Memory: 79.09 GB
GPU 2: NVIDIA A100 80GB PCIe - Free Memory: 77.56 GB
GPU 3: NVIDIA A100 80GB PCIe - Free Memory: 77.54 GB
GPU 4: NVIDIA T1000 8GB - Free Memory: 7.78 GB
Skipping GPU 4 (NVIDIA T1000 8GB) as it's the T1000.
Selected GPU 0 with 79.09 GB free memory.
Relevant environment variables:
CUDA_VISIBLE_DEVICES: 0
CUDA_DEVICE_ORDER: PCI_BUS_ID
PyTorch CUDA information:
PyTorch CUDA device count: 1
PyTorch current device: 0
PyTorch device name: NVIDIA T1000 8GB
I figured maybe a mig issue, so:
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-0159b645-5c9b-d8ec-3ab3-ecc0dc8ee513)
MIG 1g.20gb Device 0: (UUID: MIG-0f31f4c6-9357-5c01-85cb-9a845d6fad1c)
MIG 1g.20gb Device 1: (UUID: MIG-310fed97-ea9c-5128-91bd-00d25f05cd0e)
export CUDA_VISIBLE_DEVICES=MIG-0f31f4c6-9357-5c01-85cb-9a845d6fad1c
echo $CUDA_VISIBLE_DEVICES
MIG-0f31f4c6-9357-5c01-85cb-9a845d6fad1c
Still takes the T1000.
Removed all the code related to setting the GPU in the script:
import glob
import gc
import argparse
import os
import random
import sys
import time
import numpy as np
import pandas as pd
from sklearn.metrics import precision_recall_curve, roc_curve, auc
# def print_env_vars():
# print("Relevant environment variables:")
# for var in ["CUDA_VISIBLE_DEVICES", "CUDA_DEVICE_ORDER"]:
# print(f"{var}: {os.environ.get(var, 'Not set')}")
# print_env_vars()
# def select_gpu():
# import pynvml
# pynvml.nvmlInit()
# device_count = pynvml.nvmlDeviceGetCount()
# selected_device = None
# max_free_memory = 0
# print("Available GPUs (pynvml):")
# for i in range(device_count):
# handle = pynvml.nvmlDeviceGetHandleByIndex(i)
# device_name = pynvml.nvmlDeviceGetName(handle)
# if isinstance(device_name, bytes):
# device_name = device_name.decode("utf-8")
# mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
# free_memory = mem_info.free / (1024**3)
# print(f"GPU {i}: {device_name} - Free Memory: {free_memory:.2f} GB")
# if "T1000" in device_name:
# print(f"Skipping GPU {i} ({device_name}) as it's the T1000.")
# continue
# if free_memory > max_free_memory:
# selected_device = i
# max_free_memory = free_memory
# pynvml.nvmlShutdown()
# if selected_device is None:
# print("No suitable GPU found. Exiting.")
# sys.exit(1)
# print(f"Selected GPU {selected_device} with {max_free_memory:.2f} GB free memory.")
# return selected_device
# # Select GPU and set environment variable
# selected_gpu = select_gpu()
# os.environ["CUDA_VISIBLE_DEVICES"] = str(selected_gpu)
# os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
# print_env_vars()
# Now it's safe to import torch
import torch
print("PyTorch CUDA information:")
if torch.cuda.is_available():
print(f"PyTorch CUDA device count: {torch.cuda.device_count()}")
print(f"PyTorch current device: {torch.cuda.current_device()}")
try:
print(f"PyTorch device name: {torch.cuda.get_device_name(0)}")
except Exception as e:
print(f"Error getting device name: {e}")
device = torch.device("cuda")
else:
print(
"CUDA is not available. Please check your PyTorch installation and GPU setup."
)
sys.exit(1)
print(f"Using device: {device}")
Now it does work:
python -u progressive_docking_mtl.py -os 10 -bs 256 -num_units 1500 -dropout 0.2 -learn_rate 0.0001 -bin_array 2 -wt 3 -cf -9.236514914296889 -rec 0.9 -n_it 1 -t_mol 65.994673 --data_path /mnt/data/dk/work/DeepDocking/projects/2RH1_mtl --save_path /mnt/data/dk/work/DeepDocking/projects/2RH1_mtl -n_mol 462000
PyTorch CUDA information:
PyTorch CUDA device count: 1
PyTorch current device: 0
PyTorch device name: NVIDIA A100 80GB PCIe MIG 1g.20gb
Using device: cuda
But why did it work before? Why did it stop working? How should I programatically select a GPU for PyTorch to run in in a group environment when people don't want me running my PyTorch code on their GPU's being used for other things? Or I do I manually have to check and edit in the specific MIG I want everytime, for every job? I just feel like there must be a better approach than what I am doing here, or fundamentally I am misunderstanding something. If you have any advice, I would really appreciate it! @ptrblck @malfet
As per discussion with @ptrblck we suggest to implement a warning for this use case for release 2.4.1
🐛 Describe the bug
Hi PyTorch community, I have been troubleshooting this for a few days but can't seem to fix it.
Essentially, I want to be able to get the device_name or other properties to dynamically allocate GPUs to a particular job.
I have tried various builds, but it always fails at that call. I can get stuff on GPU - especially by setting CUDA_VISIBLE_DEVICES, but a programmatic solution would be ideal.
Running this:
Returns:
PyTorch version: 2.4.0 CUDA available: True CUDA version: 12.4 Error initializing CUDA: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1720538439675/work/aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=, num_gpus=
Here is my nvidia-smi:
I have also tried a different version of PyTorch, which I installed using this command from PyTorch.org:
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch-nightly -c nvidia
In that environment, I get the same error (but strangely, it is not saying I have the right CUDA version).
I have also tried the collect_env.py script in both environments, I think the error is identical, but I will give both either way:
nightly build:
"stable" build:
What is really strange to me is I can use nightly build to run ML code that does indeed go to GPU. The entire code is probably superflous for everyone to deal with, but I will just give the relevant parts:
First, I can echo $CUDA_VISIBLE_DEVICES and see nothing is set. Then I can run my code:
and see that indeed, my process is on the GPU I want:
Here is that code for reference:
I am not super savvy but I am very confused as to what is going on. I think I am delirious from trying to manage and troubleshoot these environments, so sorry if this makes little sense! It wasn't until I wrote this bug report that I tried my code again and it is working now - but it kept giving me issues before. Still, it seems strange that the most basic version of get_device() seems to fail - requiring pynvml and all of these os calls to get the GPU allocation I want.
Versions
nightly build:
"stable" build:
cc @seemethere @malfet @osalpekar @atalman @ptrblck @msaroufim