mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.59k stars 553 forks source link

single_stage_detector/ssd/bind_launch.py not compatible with NVIDIA MIG GPU instances #442

Closed kpouget closed 3 years ago

kpouget commented 3 years ago

Hello,

I tried to run the ssd benchmark on NVIDIA A100 with MIG GPU instances, and it appears to be incompatible with the way MIG GPU instances are made accessible.

I had this error thrown for all the subprocesses spawned, except the 1st one:

THCudaCheck FAIL file=/opt/pytorch/pytorch/torch/csrc/cuda/Module.cpp line=33 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib64/python3.6/site-packages/torch/cuda/__init__.py", line 265, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/pytorch/pytorch/torch/csrc/cuda/Module.cpp:33

This error can be easily reproduced with this Python one-liner:

sh-4.4# python -c "import torch; torch.cuda.set_device(0)" # works
sh-4.4# python -c "import torch; torch.cuda.set_device(1)" # doesn't work, nor other devices
...

while the correct way, AFAIK, is:

env CUDA_VISIBLE_DEVICES=MIG_UID_0 python -c "import torch; torch.cuda.set_device(0)"
env CUDA_VISIBLE_DEVICES=MIG_UID_1 python -c "import torch; torch.cuda.set_device(0)"
...

I've prepared a quick&dirty patch to modify bind_launch.py to take that into account, see there: https://github.com/kpouget/training/commit/c4c4bc5388c5d0c8b29758990a08d04763452684

kpouget commented 3 years ago

Closing this bug as NCCL (used by MLCOMMONS) is not compatible with MIG devices (GPU peer to peer communications not support between MIG instances)