open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.56k stars 9.46k forks source link

When I trained the rtmdet-ins-m model using multiple Gpus, the program froze after 300 epochs of training!!! #11666

Open tianlan6767 opened 6 months ago

tianlan6767 commented 6 months ago

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.
  3. The bug has not been fixed in the latest version.

Describe the bug A clear and concise description of what the bug is. Using 0 and 1 GPUS training model, the program inexplicably stuck, not clear what is the reason, please help to see 20240425_154336.log 20240425_154336.json image

Reproduction

  1. What command or script did you run?
A placeholder for the command.
    bash ./tools/dist_train.sh ./configs/rtmdet/config.py
  1. Did you make any modifications on the code or config? Did you understand what you have modified?

  2. What dataset did you use?

Environment

  1. Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.
  2. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback If applicable, paste the error trackback here.

A placeholder for trackback.

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

tianlan6767 commented 6 months ago

the config: rtmdet-ins_m_8xb32-300e_coco_used.txt

mchaniotakis commented 6 months ago

Just speculating here, the second GPU seems almost at capacity, i have had memory issues that froze training. Have you tried training with fewer samples past 300? I would say change the batch size to batch_size=8 (multiple of 2) and try again.

tianlan6767 commented 6 months ago

Just speculating here, the second GPU seems almost at capacity, i have had memory issues that froze training. Have you tried training with fewer samples past 300? I would say change the batch size to batch_size=8 (multiple of 2) and try again.

This seems to be a probability, I trained the same configuration with 500 rounds last night

tianlan6767 commented 6 months ago

i add a randomcrop in default config, caused that dist training infinite waiting image image

CONFIG

image

tianlan6767 commented 6 months ago

i used the configs of solov2 and cascade-mask-rcnn_r50_fpn_ms-3x_coco, And the configuration is adjusted for multiple GPU training. all is fine!!! why use rtmdet-Instance segmentation config that caused the code to wait indefinitely! @RangiLyu Please take a look at this bug!

tianlan6767 commented 6 months ago

i trained the model based on docker environment, and changed multiple versions from 3.0.0 to 3.3.0, after finetineing the configs, the GPUs waited indefinitely. @RangiLyu

tianlan6767 commented 6 months ago

Can anyone give me an answer or a solution? ? ?