Questions about Multi-gpu training

megvii-research / MOTRv2

[CVPR2023] MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors

Other

383 stars 47 forks source link

Questions about Multi-gpu training #16

Open hahapt opened 1 year ago

hahapt commented 1 year ago

Why I try to train it with 4 V100, it occured CUDA OUT OF MEMORY during epoch 0. I set nproc_per_node=4 in train.sh, is there anything wrong?

zyayoung commented 1 year ago

We tried training on 8 V100 GPUs without checkpointing and do not encounter a CUDA OOM issue.

If OOM occurs in the middle of an epoch, maybe MOTR cumulates too many false positive track queries, resulting in a large decoder memory consumption. You may try the --use_checkpoint argument on V100 GPUs as well.

Paige-Norton commented 1 year ago

这个问题出现在一开始，这导致我们无法开始训练，请问如何调小以满足显存要求？