Open zhengzehong331 opened 12 months ago
Thank you for your great works! I meet this problem when i train the model with hmdb_51 dataset:
[2023-09-17 14:18:47 ViT-B/16](main.py 181): INFO Train: [0/50][0/3383] eta 0:49:54 lr 0.000000000 time 0.8851 (0.8851) tot_loss 2.6029 (2.6029) mem 8942MB ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed. Traceback (most recent call last): File "main.py", line 278, in <module> main(config) File "main.py", line 104, in main train_one_epoch(epoch, model, criterion, optimizer, lr_scheduler, train_loader, text_labels, config, mixup_fn) File "main.py", line 144, in train_one_epoch images, label_id = mixup_fn(images, label_id) File "/root/autodl-tmp/VideoX/X-CLIP/datasets/blending.py", line 57, in __call__ **kwargs) File "/root/autodl-tmp/VideoX/X-CLIP/datasets/blending.py", line 214, in do_blending return self.do_mixup(imgs, label) File "/root/autodl-tmp/VideoX/X-CLIP/datasets/blending.py", line 202, in do_mixup mixed_imgs = lam * imgs + (1 - lam) * imgs[rand_index, :] RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:174 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fdb60c737d2 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x267df7a (0x7fdbb3c92f7a in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #2: <unknown function> + 0x301898 (0x7fdc1608c898 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #3: c10::TensorImpl::release_resources() + 0x175 (0x7fdb60c5c005 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libc10.so) frame #4: <unknown function> + 0x1edf69 (0x7fdc15f78f69 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #5: <unknown function> + 0x4e5818 (0x7fdc16270818 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #6: THPVariable_subclass_dealloc(_object*) + 0x299 (0x7fdc16270b19 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #7: /root/miniconda3/envs/xclip/bin/python() [0x4a0a87] frame #8: /root/miniconda3/envs/xclip/bin/python() [0x4b0858] frame #9: /root/miniconda3/envs/xclip/bin/python() [0x4c5b50] frame #10: /root/miniconda3/envs/xclip/bin/python() [0x4c5b66] frame #11: /root/miniconda3/envs/xclip/bin/python() [0x4c5b66] frame #12: /root/miniconda3/envs/xclip/bin/python() [0x4946f7] frame #13: PyDict_SetItemString + 0x61 (0x499261 in /root/miniconda3/envs/xclip/bin/python) frame #14: PyImport_Cleanup + 0x89 (0x56f719 in /root/miniconda3/envs/xclip/bin/python) frame #15: Py_FinalizeEx + 0x67 (0x56b1a7 in /root/miniconda3/envs/xclip/bin/python) frame #16: /root/miniconda3/envs/xclip/bin/python() [0x53fc79] frame #17: _Py_UnixMain + 0x3c (0x53fb3c in /root/miniconda3/envs/xclip/bin/python) frame #18: __libc_start_main + 0xf3 (0x7fdc1897d083 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #19: /root/miniconda3/envs/xclip/bin/python() [0x53f9ee] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 15424) of binary: /root/miniconda3/envs/xclip/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/xclip/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/root/miniconda3/envs/xclip/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module> main() File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run )(*cmd_args) File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ====================================================== main.py FAILED ------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-09-17_14:18:51 host : autodl-container-7850119152-163467d4 rank : 0 (local_rank: 0) exitcode : -6 (pid: 15424) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 15424 ======================================================
I was very confused and tried many methods but couldn't solve it.
my gpu is NVIDIA GeForce RTX 2080 Ti * 1
我也遇到了相同的问题,请问你的问题是怎么解决的
Thank you for your great works! I meet this problem when i train the model with hmdb_51 dataset:
I was very confused and tried many methods but couldn't solve it.
my gpu is NVIDIA GeForce RTX 2080 Ti * 1