YOLOX probabilistically appears RuntimeError #549

hhaAndroid commented 1 year ago

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (master) or latest version (3.x).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

3.x branch https://github.com/open-mmlab/mmdetection/tree/3.x

Environment

mmyolo 0.4.0+dev

Reproduces the problem - code sample

...

Reproduces the problem - command or script

..

Reproduces the problem - error message

File "pt19cu111/lib/python3.8/site-packages/mmdet/models/task_modules/assigners/sim_ota_assigner.py", line 204, in dynamic_k_matching
    _, pos_idx = torch.topk(                      
RuntimeError: CUDA error: device-side assert triggered                                               
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                               
[W CUDAGuardImpl.h:112] Warning: CUDA warning: device-side assert triggered (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'

Additional information

No response

tojimahammatov commented 1 year ago

Hi, can you run the same command by adding CUDA_LAUNCH_BLOCKING=1 ? It will give better error message and trace in that case. Some thoughts for reasons are input/target wrong values, wrong values passed to loss function, etc. Please provide the error message after using CUDA_LAUNCH_BLOCKING=1.

mattiasegu commented 1 year ago

Hi @hhaAndroid, did you find a solution?

I'm facing the same problem. I'm now launching again with CUDA_LAUNCH_BLOCKING=1 and I'll report the error message here.

MingChaoXu commented 1 year ago

did you solve this problem? i met this error too

mattiasegu commented 1 year ago

Hi @MingChaoXu , in my case it ended up being related to some numerical instability issue. I solved it by changing the loss weight. Possible solutions are changing the loss weight, reducing the learning rate, increasing the strength of gradient clipping (lower max norm).

Hope it helps!

ZeeRizvee commented 3 months ago

I ran into a similar error. It was resolved by reducing the value of the base learning rate (base_lr).

open-mmlab / mmdetection