Open we11as22 opened 1 year ago
@AlexeySudakovB01-109 Thank you for the feedback. We will work on fixing it as soon as possible.
any update?
Looking forward to further work!
@we11as22 @hhaAndroid I am facing the same issue, when training Mask2Former with --amp
. And I have found a solution here huggingface/transformers#21644. As it proposed, this is caused by how scipy.optimize.linear_sum_assignment
handles infinite values. Replacing these with very large numbers seems to fix the issue. And I added the following code before line 125.
cost = torch.minimum(cost, torch.tensor(1e10))
cost = torch.maximum(cost, torch.tensor(-1e10))
But when trying to train with amp again, I encountered the NaN
mask loss problem as follows:
Further debugging revealed issues in the calculation of loss_mask
in this line. In amp mode, multiplying the FP16 variable num_total_masks
by a large integer variable self.num_points
(it is 12544 in my configs)results in an overflowed FP16 variable, further leading to the occurrence of NaN mask loss.
By casting num_total_masks
to the fp32 type with avg_factor=num_total_masks.float() * self.num_points
, I managed to eliminate this issue. The model can now be successfully trained in amp mode, but the accuracy is still under validation 😂 .
Hope this is helpful to you!
Thanks for your error report and we appreciate it a lot.
Checklist
Describe the bug A clear and concise description of what the bug is.
Reproduction
Environment
python mmdet/utils/collect_env.py
to collect necessary environment information and paste it here.$PATH
,$LD_LIBRARY_PATH
,$PYTHONPATH
, etc.)Error traceback If applicable, paste the error trackback here.
Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!