Open zhengyuan-xie opened 9 months ago
It seems that because I forgot to add self.model.eval()
before computing prototypes (and noise), some internal operations of DeepLabV3 may not work well with DDP training. Adding self.model.eval()
at the beginning of the compute_prototypes and compute_noise methods, and adding self.model.train()
at the end of them, will solve this problem. For details, please see the updated "base/base_trainer.py".
Thank you for your attention to our work and for finding this issue.
I'll try it and thanks again for your fast reply!
Sorry to bother you but I have another problem, the process hangs after the mIoU was reported, as follows:
Could you give me some advice?
Sorry for the late reply. I've been quite busy lately.
I took a look at this issue, and although the process hangs after it's completed, the necessary data and checkpoints have already been saved. So you can manually terminate the program.
I'll try to address this issue when I have some free time in the future. Thank you for bringing it to my attention.
Thanks!
Hi, thank you for your work. I tried to use 4 GPUs to reproduce the result, and I set the epoch to 1 for debugging. But the process hangs when computing prototypes after the base step's training, and GPU-util reaches 100%. Are there any suggestions? Thanks!