Open hahapt opened 1 year ago
We tried training on 8 V100 GPUs without checkpointing and do not encounter a CUDA OOM issue.
If OOM occurs in the middle of an epoch, maybe MOTR cumulates too many false positive track queries, resulting in a large decoder memory consumption. You may try the --use_checkpoint
argument on V100 GPUs as well.
这个问题出现在一开始,这导致我们无法开始训练,请问如何调小以满足显存要求?
Why I try to train it with 4 V100, it occured CUDA OUT OF MEMORY during epoch 0. I set nproc_per_node=4 in train.sh, is there anything wrong?