DDP training hangs when computing prototypes

jinpeng0528 / STAR

Code release for "Saving 100x Storage: Prototype Replay for Reconstructing Training Sample Distribution in Class-Incremental Semantic Segmentation" (NeurIPS 2023)

12 stars 1 forks source link

DDP training hangs when computing prototypes #2

Open zhengyuan-xie opened 9 months ago

zhengyuan-xie commented 9 months ago

Hi, thank you for your work. I tried to use 4 GPUs to reproduce the result, and I set the epoch to 1 for debugging. But the process hangs when computing prototypes after the base step's training, and GPU-util reaches 100%. Are there any suggestions? Thanks!

jinpeng0528 commented 9 months ago

It seems that because I forgot to add self.model.eval() before computing prototypes (and noise), some internal operations of DeepLabV3 may not work well with DDP training. Adding self.model.eval() at the beginning of the compute_prototypes and compute_noise methods, and adding self.model.train() at the end of them, will solve this problem. For details, please see the updated "base/base_trainer.py".

Thank you for your attention to our work and for finding this issue.

zhengyuan-xie commented 9 months ago

I'll try it and thanks again for your fast reply!

zhengyuan-xie commented 9 months ago

Sorry to bother you but I have another problem, the process hangs after the mIoU was reported, as follows:

Could you give me some advice?

jinpeng0528 commented 8 months ago

Sorry for the late reply. I've been quite busy lately.

I took a look at this issue, and although the process hangs after it's completed, the necessary data and checkpoints have already been saved. So you can manually terminate the program.

I'll try to address this issue when I have some free time in the future. Thank you for bringing it to my attention.

zhengyuan-xie commented 8 months ago

Thanks!