I tried to run the ssd benchmark on NVIDIA A100 with MIG GPU instances, and it appears to be incompatible with the way MIG GPU instances are made accessible.
I had this error thrown for all the subprocesses spawned, except the 1st one:
THCudaCheck FAIL file=/opt/pytorch/pytorch/torch/csrc/cuda/Module.cpp line=33 error=101 : invalid device ordinal
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/lib64/python3.6/site-packages/torch/cuda/__init__.py", line 265, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/pytorch/pytorch/torch/csrc/cuda/Module.cpp:33
This error can be easily reproduced with this Python one-liner:
sh-4.4# python -c "import torch; torch.cuda.set_device(0)" # works
sh-4.4# python -c "import torch; torch.cuda.set_device(1)" # doesn't work, nor other devices
...
Hello,
I tried to run the
ssd
benchmark on NVIDIA A100 with MIG GPU instances, and it appears to be incompatible with the way MIG GPU instances are made accessible.I had this error thrown for all the subprocesses spawned, except the 1st one:
This error can be easily reproduced with this Python one-liner:
while the correct way, AFAIK, is:
I've prepared a quick&dirty patch to modify
bind_launch.py
to take that into account, see there: https://github.com/kpouget/training/commit/c4c4bc5388c5d0c8b29758990a08d04763452684