megvii-research / MOTRv2

[CVPR2023] MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors
Other
343 stars 44 forks source link

CUDA Runtime Error while Training #61

Open semabtl opened 6 months ago

semabtl commented 6 months ago

I'm trying to train the model on Google Colab, using T4 GPU. I got this error:

Traceback (most recent call last): File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/main.py", line 332, in Traceback (most recent call last): File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/main.py", line 332, in main(args) File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/main.py", line 184, in main main(args) File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/main.py", line 184, in main utils.init_distributed_mode(args) File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/util/misc.py", line 442, in init_distributed_mode utils.init_distributed_mode(args) File "/content/drive/MyDrive/MOTRv2/exps/motrv2/run11/util/misc.py", line 442, in init_distributed_mode torch.cuda.set_device(args.gpu) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 404, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 196, in main() File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError

I tried to run with TORCH_USE_CUDA_DSA=1 and CUDA_LAUNCH_BLOCKING=1. I also checked CUDA version and CUDA devices. I tried many ways but couldn't solve this problem.
Thank you in advance.