open-mmlab / mmtracking

OpenMMLab Video Perception Toolbox. It supports Video Object Detection (VID), Multiple Object Tracking (MOT), Single Object Tracking (SOT), Video Instance Segmentation (VIS) with a unified framework.
https://mmtracking.readthedocs.io/en/latest/
Apache License 2.0
3.56k stars 598 forks source link

NAN loss when training on MOT20 #312

Open sjtuytc opened 3 years ago

sjtuytc commented 3 years ago

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug When I need to train on MOT20 without any modification on the code, the loss is always NAN.

Reproduction

  1. What command or script did you run?
bash ./tools/dist_train.sh ./configs/det/faster-rcnn_r50_fpn_8e_mot20-half.py 8 \
--work-dir ./work_dirs/
  1. Did you make any modifications on the code or config? Did you understand what you have modified? No

  2. What dataset did you use and what task did you run? MOT20, training Environment

  3. Please run python mmtrack/utils/collect_env.py to collect necessary environment information and paste it here.

  4. You may add addition that may be helpful for locating the problem, such as

    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback If applicable, paste the error trackback here. image

A placeholder for trackback.

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

GT9505 commented 3 years ago

I use the command: bash ./tools/dist_train.sh ./configs/det/faster-rcnn_r50_fpn_8e_mot20-half.py 8 \ --work-dir ./work_dirs/, and the detector is sucessfully trained on mot20. Please refer to the picture

RoyAn2386 commented 1 year ago

If you still got error, try reducing your learning rate to 0.0001.