rulixiang / ToCo

[CVPR 2023] Token Contrast for Weakly-Supervised Semantic Segmentation
147 stars 11 forks source link

NaN loss #8

Open Luffy03 opened 1 year ago

Luffy03 commented 1 year ago

Hi, thx for your excellent work first! I am using A100 GPU to re-produce your results, but the NaN loss still exists. train.log

AmeenAli commented 1 year ago

I have also faced NaN losses while running on A5000 cards, the issue is solved once I moved to 2080Ti Cards (as the original authors reported)

Luffy03 commented 1 year ago

I have also faced NaN losses while running on A5000 cards, the issue is solved once I moved to 2080Ti Cards (as the original authors reported)

Thanks for your kind answer! But I find that the author can train ToCo on A100 GPU (https://github.com/rulixiang/ToCo/blob/main/logs/toco_vit-b_voc_20k.log). I find that it doesn't work on 3090 GPU as well. I want to know why GPU type matters so much.

massica17 commented 1 year ago

Sorry to bother you , I can not create environment ,and report No module named 'bilateralfilter' , I try pip install and succseeful ,but still report this error ,can you help me

Luffy03 commented 1 year ago

Sorry to bother you , I can not create environment ,and report No module named 'bilateralfilter' , I try pip install and succseeful ,but still report this error ,can you help me

Follow the author's instruction "build reg loss"

Luffy03 commented 1 year ago

I may solve the problems. The code can work on 3090 GPUs if I don't use GPU:0. I still don't know why. For NaN in seg_loss, that is because a cropped image may be filled with ignore_index. In this case, the loss will return NaN. I change the 'get_seg_loss' as follows. But I reproduce the experiments with only 67.3% mIoU. There is still a large margin compared with 69.215% reported in 'https://github.com/rulixiang/ToCo/blob/main/logs/toco_vit-b_voc_20k.log'. image

ymmm-4 commented 1 year ago

@Luffy03 Can you contact me on my wechat ID yahm30. I need to discuss something regarding code.

huiqing-su commented 1 year ago

@Luffy03 can you tell me what is 'ce' in get_seg_loss?Thank you

Asunatan commented 1 year ago

I tried to debug what caused the loss to become NAN? I found that the problem is in CTCLoss_neg, which is a contrastive learning loss. Do you have any solution? The same problem is also in [Masked Collaborative Contrast for Weakly Supervised Semantic Segmentation](https://arxiv.org/abs/2305.08491)

YiZhuo-Xu commented 11 months ago

@Luffy03 thank you for your experiment about this question. When I use one 3090, the loss will not become NAN,but the result is bad.Then I try to use two 3090 and all loss except the seg loss turn to be NAN. train.log