"device >= 0 && device < num_gpus INTERNAL ASSERT FAILED" with torch 2.5.0.dev20240705+cu121 on 2 GPU NVIDIA-A100-SXM4-80GB-MIG-3g.40gb

🐛 Describe the bug

The error torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED has been reported earlier in #128819 and should have been resolved in 2.4.0rc1, but I still see the error in pytorch nightly (2.5.0.dev20240705+cu121) on a NVIDIA-A100-SXM4-80GB-MIG-3g.40gb using 2 GPUs when running the collect_env script.

Also related to #107300 which confirms there is still an issue on MIG machines but that issue is already closed.

Traceback

Collecting environment information...
Error executing job with overrides: []
Traceback (most recent call last):
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 327, in _lazy_init
queued_call()
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 195, in _check_capability
capability = get_device_capability(d)
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 504, in get_device_capability
prop = get_device_properties(device)
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 522, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus=
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/instanovo/transformer/train.py", line 822, in main
collect_env() # TODO remove. Only used for debugging
File "/app/instanovo/utils/collect_env.py", line 636, in main
output = get_pretty_env_info()
File "/app/instanovo/utils/collect_env.py", line 631, in get_pretty_env_info
return pretty_str(get_env_info())
File "/app/instanovo/utils/collect_env.py", line 491, in get_env_info
cuda_module_loading=get_cuda_module_loading_config(),
File "/app/instanovo/utils/collect_env.py", line 425, in get_cuda_module_loading_config
torch.cuda.init()
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 281, in init
_lazy_init()
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 333, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus=
CUDA call was originally invoked at:
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/app/instanovo/transformer/train.py", line 17, in <module>
import pytorch_lightning as ptl
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/opt/venv/lib/python3.10/site-packages/pytorch_lightning/__init__.py", line 25, in <module>
from lightning_fabric.utilities.seed import seed_everything # noqa: E402
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/opt/venv/lib/python3.10/site-packages/lightning_fabric/__init__.py", line 30, in <module>
from lightning_fabric.fabric import Fabric # noqa: E402
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/opt/venv/lib/python3.10/site-packages/lightning_fabric/fabric.py", line 35, in <module>
import torch
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/opt/venv/lib/python3.10/site-packages/torch/__init__.py", line 1903, in <module>
_C._initExtension(_manager_path())
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 259, in <module>
_lazy_call(_check_capability)
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 256, in _lazy_call
_queued_calls.append((callable, traceback.format_stack()))

Versions

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
PyTorch Lightning version: 2.3.2
Torch version: 2.5.0.dev20240705+cu121
CUDA version: 12.1
CUDNN version: 90100
Flash SDP enabled: True
2 GPUs on NVIDIA-A100-SXM4-80GB-MIG-3g.40gb

cc @seemethere @malfet @osalpekar @atalman @ptrblck @msaroufim

If I just run the collect_env script on it's own, it works just fine, see output below. In the previous comment I called the collect_env script from line 822 in my training script and then I got the traceback.

Collecting environment information...
PyTorch version: 2.5.0.dev20240705+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-1055-nvidia-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
MIG 3g.40gb Device 0:
MIG 3g.40gb Device 1:
Nvidia driver version: 535.161.08
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7742 64-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
Stepping: 0
Frequency boost: enabled
CPU max MHz: 2250.0000
CPU min MHz: 1500.0000
BogoMIPS: 4491.29
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization: AMD-V
L1d cache: 4 MiB (128 instances)
L1i cache: 4 MiB (128 instances)
L2 cache: 64 MiB (128 instances)
L3 cache: 512 MiB (32 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-15,128-143
NUMA node1 CPU(s): 16-31,144-159
NUMA node2 CPU(s): 32-47,160-175
NUMA node3 CPU(s): 48-63,176-191
NUMA node4 CPU(s): 64-79,192-207
NUMA node5 CPU(s): 80-95,208-223
NUMA node6 CPU(s): 96-111,224-239
NUMA node7 CPU(s): 112-127,240-255
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pytorch-lightning==2.3.2
[pip3] pytorch-triton==3.0.0+dedb7bdf33
[pip3] torch==2.5.0.dev20240705+cu121
[pip3] torchmetrics==1.1.2
[conda] Could not collect

Minimal reproducible example. Replacing the call to main() in the collect_env.py script with:

if __name__ == '__main__':
    from lightning.fabric import Fabric

    fabric = Fabric(
            accelerator="gpu",
            devices=2,
            strategy="ddp",
            precision="16-mixed"
        )

    main()

triggers the error:

Using 16-bit Automatic Mixed Precision (AMP)
/opt/venv/lib/python3.10/site-packages/lightning/fabric/plugins/precision/amp.py:52: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
Collecting environment information...
Traceback (most recent call last):
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 327, in _lazy_init
queued_call()
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 195, in _check_capability
capability = get_device_capability(d)
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 504, in get_device_capability
prop = get_device_properties(device)
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 522, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus=
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/instanovo/utils/collect_env.py", line 664, in <module>
main()
File "/app/instanovo/utils/collect_env.py", line 636, in main
output = get_pretty_env_info()
File "/app/instanovo/utils/collect_env.py", line 631, in get_pretty_env_info
return pretty_str(get_env_info())
File "/app/instanovo/utils/collect_env.py", line 491, in get_env_info
cuda_module_loading=get_cuda_module_loading_config(),
File "/app/instanovo/utils/collect_env.py", line 425, in get_cuda_module_loading_config
torch.cuda.init()
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 281, in init
_lazy_init()
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 333, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus=
CUDA call was originally invoked at:
File "/app/instanovo/utils/collect_env.py", line 18, in <module>
import torch
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/opt/venv/lib/python3.10/site-packages/torch/__init__.py", line 1903, in <module>
_C._initExtension(_manager_path())
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 259, in <module>
_lazy_call(_check_capability)
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 256, in _lazy_call
_queued_calls.append((callable, traceback.format_stack()))

It is very much possible that this hasn't propagated to nightlies yet.

hi @BioGeek running your repro on A100 with lightning 2.3.2 pypi_0 pypi lightning-utilities 0.11.3.post0 pypi_0 pypi torch 2.5.0.dev20240708+cu121 pypi_0 pypi (tested with torch 2.5.0.dev20240705+cu121 as well)

I get this output: Using 16-bit Automatic Mixed Precision (AMP) /site-packages/lightning/fabric/plugins/precision/amp.py:52: torch.cuda.amp.GradScaler(args...) is deprecated. Please use torch.amp.GradScaler('cuda', args...) instead.

Looks like its specific to MIG

I can confirm it works fine on a A100 and is MIG specific.

MIG worked fine when created and used properly as seen e.g. here. @BioGeek did you check if any other CUDA application is successfully running in this MIG slice? If so, does a smoke test already fail via e.g. python -c "import torch; torch.randn(1).cuda()"?

The currently posted code snippets uses higher-level APIs and I'm not familiar with their internal setup, so it would be great if we could narrow down a pure PyTorch code snippet reproducing the issue.

@ptrblck A smoke test like torch.empty(2, device="cuda") fails.

Debug script which mostly prints out version info:

import os
import sys
import torch
from nvitop import Device
import subprocess

def print_nvidia_smi_output():
    try:
        result = subprocess.run(['nvidia-smi'], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
        if result.returncode != 0:
            print("Error running nvidia-smi:", result.stderr)
        else:
            print(result.stdout)
    except FileNotFoundError:
        print("nvidia-smi command not found. Please make sure NVIDIA drivers are installed and nvidia-smi is in your PATH.")

def print_process_info():
    devices = Device.cuda.all()
    for device in devices:
        processes = device.processes() 
        sorted_pids = sorted(processes.keys())
        print(f'Processes ({len(processes)}): {sorted_pids}')
        for pid in sorted_pids:
            print(f'\t- {processes[pid]}')

print(f"Python version: {sys.version}")
print(f"Torch version: {torch.__version__}")
print(f"CUDA version: {torch.version.cuda}")
print(f"CUDNN version: {torch.backends.cudnn.version()}")
print(f"Flash SDP enabled: {torch.backends.cuda.flash_sdp_enabled()}")
print(f"Device count: {torch.cuda.device_count()}")
print(f"CUDA_VISIBLE_DEVICES: {os.environ.get('CUDA_VISIBLE_DEVICES')}")
print_nvidia_smi_output()
print_process_info()

# smoke test
print(f'torch.empty(2, device="cuda"): {torch.empty(2, device="cuda")}')

Output:

Python version: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0]
Torch version: 2.5.0.dev20240719+cu124
CUDA version: 12.4
CUDNN version: 90100
Flash SDP enabled: True
Device count: 2
CUDA_VISIBLE_DEVICES: None
Fri Jul 19 18:47:35 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.4 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:47:00.0 Off | On |
| N/A 34C P0 81W / 400W | N/A | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:4E:00.0 Off | On |
| N/A 48C P0 173W / 400W | N/A | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 1 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 1 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Processes (0): []
Traceback (most recent call last):
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 327, in _lazy_init
queued_call()
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 195, in _check_capability
capability = get_device_capability(d)
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 504, in get_device_capability
prop = get_device_properties(device)
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 522, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus=
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/instanovo/utils/debug.py", line 37, in <module>
print(f'torch.empty(2, device="cuda"): {torch.empty(2, device="cuda")}')
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 333, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus=
CUDA call was originally invoked at:
File "/app/instanovo/utils/debug.py", line 3, in <module>
import torch
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/opt/venv/lib/python3.10/site-packages/torch/__init__.py", line 1940, in <module>
_C._initExtension(_manager_path())
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 259, in <module>
_lazy_call(_check_capability)
File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 256, in _lazy_call
_queued_calls.append((callable, traceback.format_stack()))

Thanks for the check! I still cannot reproduce any issues using MIG and see the expected error only if CUDA_VISIBLE_DEVICES isn't used properly:

python -c "import torch; print(torch.__version__); print(torch.randn(1).cuda())"
2.5.0.dev20240722+cu124
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 327, in _lazy_init
    queued_call()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 195, in _check_capability
    capability = get_device_capability(d)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 504, in get_device_capability
    prop = get_device_properties(device)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 522, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":49, please report a bug to PyTorch. device=1, num_gpus=

nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-e4d9d2d4-9062-a16e-4255-b8d263aabdcb)
  MIG 3g.40gb     Device  0: (UUID: MIG-9e7b4ed9-c1d1-564c-b20d-3a3a041fbae1)
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-f1ca29a7-ca35-0e12-f9cd-ede2a0c442c5)
...

CUDA_VISIBLE_DEVICES=MIG-9e7b4ed9-c1d1-564c-b20d-3a3a041fbae1 python -c "import torch; print(torch.__version__); print(torch.randn(1).cuda())"
2.5.0.dev20240722+cu124
tensor([0.2847], device='cuda:0')

I can confirm that parsing the MIG UUIDs from nvidia-smi and setting CUDA_VISIBLE_DEVICES to those values makes the program work.

import os
import sys
import torch
import subprocess
import re

def get_mig_uuids():
    result = subprocess.run(['nvidia-smi', '-L'], stdout=subprocess.PIPE, text=True)

    if result.returncode != 0:
        raise RuntimeError(f"Command 'nvidia-smi -L' failed with exit code {result.returncode}")

    output = result.stdout
    print(output)

    mig_uuid_pattern = re.compile(r'MIG-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}')

    mig_uuids = mig_uuid_pattern.findall(output)

    return mig_uuids

def set_cuda_visible_devices(mig_uuids):
    mig_uuids_str = ','.join(mig_uuids)
    os.environ['CUDA_VISIBLE_DEVICES'] = mig_uuids_str
    print(f"CUDA_VISIBLE_DEVICES set to: {mig_uuids_str}")

print(f"Python version: {sys.version}")
print(f"Torch version: {torch.__version__}")
print(f"CUDA version: {torch.version.cuda}")
print(f"CUDNN version: {torch.backends.cudnn.version()}")
print(f"Flash SDP enabled: {torch.backends.cuda.flash_sdp_enabled()}")
print(f"Device count: {torch.cuda.device_count()}")
print(f"CUDA_VISIBLE_DEVICES: {os.environ.get('CUDA_VISIBLE_DEVICES')}")

mig_uuids = get_mig_uuids()
if mig_uuids:
    set_cuda_visible_devices(mig_uuids)
else:
    print("No MIG devices found.")

# smoke tests
print(f"torch.randn(1).cuda(): {torch.randn(1).cuda()}")
print(f'torch.empty(2, device="cuda"): {torch.empty(2, device="cuda")}')

gives as output:

Python version: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0]
Torch version: 2.5.0.dev20240722+cu124
CUDA version: 12.4
CUDNN version: 90100
Flash SDP enabled: True
Device count: 2
CUDA_VISIBLE_DEVICES: None
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-4eda7b9f-c2ef-7911-5b08-85d21e06b933)
MIG 3g.40gb Device 0: (UUID: MIG-2ebddc71-af15-500e-9896-7f0441a5d400)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-755bceef-0705-9b5c-37db-1c9db1d5c73d)
MIG 3g.40gb Device 0: (UUID: MIG-adb10ab9-2eb5-53c4-a771-d268f765d32f)
CUDA_VISIBLE_DEVICES set to: MIG-2ebddc71-af15-500e-9896-7f0441a5d400,MIG-adb10ab9-2eb5-53c4-a771-d268f765d32f
torch.randn(1).cuda(): tensor([-0.0215], device='cuda:0')
torch.empty(2, device="cuda"): tensor([-0.0215, 0.0000], device='cuda:0')

see the expected error only if CUDA_VISIBLE_DEVICES isn't used properly

Can you point to documentation that explains what the proper way using CUDA_VISIBLE_DEVICES on MIG devices is?

My intention was to do distributed training on multiple MIG devices, but after reading this answer it seems that is not possible?

My intention was to do distributed training on multiple MIG devices, but after reading this answer it seems that is not possible?

This is correct and it's not possible. From the docs:

MIG supports running CUDA applications by specifying the CUDA device on which the application should be run. With CUDA 11/R450 and CUDA 12/R525, only enumeration of a single MIG instance is supported. In other words, regardless of how many MIG devices are created (or made available to a container), a single CUDA process can only enumerate a single MIG device.

Thanks for confirming it's working fine by setting CUDA_VISIBLE_DEVICES to the right MIG slice.

pytorch / pytorch