Open tianlan6767 opened 6 months ago
the config: rtmdet-ins_m_8xb32-300e_coco_used.txt
Just speculating here, the second GPU seems almost at capacity, i have had memory issues that froze training. Have you tried training with fewer samples past 300? I would say change the batch size to batch_size=8 (multiple of 2) and try again.
Just speculating here, the second GPU seems almost at capacity, i have had memory issues that froze training. Have you tried training with fewer samples past 300? I would say change the batch size to batch_size=8 (multiple of 2) and try again.
This seems to be a probability, I trained the same configuration with 500 rounds last night
i add a randomcrop in default config, caused that dist training infinite waiting
i used the configs of solov2 and cascade-mask-rcnn_r50_fpn_ms-3x_coco, And the configuration is adjusted for multiple GPU training. all is fine!!! why use rtmdet-Instance segmentation config that caused the code to wait indefinitely! @RangiLyu Please take a look at this bug!
i trained the model based on docker environment, and changed multiple versions from 3.0.0 to 3.3.0, after finetineing the configs, the GPUs waited indefinitely. @RangiLyu
Can anyone give me an answer or a solution? ? ?
Thanks for your error report and we appreciate it a lot.
Checklist
Describe the bug A clear and concise description of what the bug is. Using 0 and 1 GPUS training model, the program inexplicably stuck, not clear what is the reason, please help to see 20240425_154336.log 20240425_154336.json
Reproduction
Did you make any modifications on the code or config? Did you understand what you have modified?
What dataset did you use?
Environment
python mmdet/utils/collect_env.py
to collect necessary environment information and paste it here.$PATH
,$LD_LIBRARY_PATH
,$PYTHONPATH
, etc.)Error traceback If applicable, paste the error trackback here.
Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!