I'm trying to train the model on Google Colab, using T4 GPU. I got this error:
Traceback (most recent call last):
File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/main.py", line 332, in
Traceback (most recent call last):
File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/main.py", line 332, in
main(args)
File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/main.py", line 184, in main
main(args)
File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/main.py", line 184, in main
utils.init_distributed_mode(args)
File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/util/misc.py", line 442, in init_distributed_mode
utils.init_distributed_mode(args)
File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/util/misc.py", line 442, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 196, in
main()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError
I tried to run with TORCH_USE_CUDA_DSA=1 and CUDA_LAUNCH_BLOCKING=1. I also checked CUDA version and CUDA devices. I tried many ways but couldn't solve this problem.
Thank you in advance.
I'm trying to train the model on Google Colab, using T4 GPU. I got this error:
Traceback (most recent call last): File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/main.py", line 332, in
Traceback (most recent call last):
File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/main.py", line 332, in
main(args)
File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/main.py", line 184, in main
main(args)
File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/main.py", line 184, in main
utils.init_distributed_mode(args)
File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/util/misc.py", line 442, in init_distributed_mode
utils.init_distributed_mode(args)
File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/util/misc.py", line 442, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 196, in
main()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError
I tried to run with TORCH_USE_CUDA_DSA=1 and CUDA_LAUNCH_BLOCKING=1. I also checked CUDA version and CUDA devices. I tried many ways but couldn't solve this problem.
Thank you in advance.