loss is nan after 12 epochs

bozhenhhu commented 4 years ago

@mihaidusmanu Thank you for your great work. I use your code to train an another small dataset. But when training in the end of 12th epoch, the will be nan. I have printed the length of ids ,the sum of scores1 and scores2, as following print('ids.shape{},scores1:{},score2:{}'. format(len(ids.cpu().numpy()),a.detach().numpy(),b.detach().numpy())) so when training, Why the scores1 all turn into nan firstly?
Thank your for your reply.

bozhenhhu commented 4 years ago

02:03, 2.92it/s, loss=0.9447]ids.shape703,scores1:0.6521570086479187,score2:0.6943821907043457 10%|█ | 40/400 [00:13<02:03, 2.90it/s, loss=0.9428]ids.shape604,scores1:0.5749630331993103,score2:0.5802540183067322 10%|█ | 41/400 [00:14<02:03, 2.92it/s, loss=0.9444]ids.shape570,scores1:0.5485771894454956,score2:0.5613906383514404 10%|█ | 42/400 [00:14<02:04, 2.88it/s, loss=0.9458]ids.shape570,scores1:0.5510222911834717,score2:0.5639211535453796 11%|█ | 43/400 [00:14<02:01, 2.94it/s, loss=0.9458]ids.shape858,scores1:0.8438418507575989,score2:0.8490405678749084 11%|█ | 44/400 [00:14<02:00, 2.95it/s, loss=0.9416]ids.shape742,scores1:0.7233006954193115,score2:0.7236191034317017 11%|█▏ | 45/400 [00:15<02:00, 2.95it/s, loss=0.9428]ids.shape703,scores1:nan,score2:0.6923336386680603 12%|█▏ | 46/400 [00:15<01:59, 2.96it/s, loss=nan]ids.shape742,scores1:nan,score2:nan 12%|█▏ | 47/400 [00:15<01:59, 2.95it/s, loss=nan]ids.shape604,scores1:nan,score2:nan 12%|█▏ | 48/400 [00:16<02:02, 2.88it/s, loss=nan]ids.shape604,scores1:nan,score2:nan 12%|█▏ | 49/400 [00:16<01:59, 2.93it/s, loss=nan]ids.shape604,scores1:nan,score2:nan 12%|█▎ | 50/400 [00:17<01:59, 2.93it/s, loss=nan]ids.shape858,scores1:nan,score2:nan 13%|█▎ | 51/400 [00:17<02:00, 2.90it/s, loss=nan]ids.shape570,scores1:nan,score2:nan

mihaidusmanu commented 4 years ago

Hello. There are a lot of reasons why this could happen (inaccurate training data for instance). However, since in your case it happens so late in training, I suspect it is rather due to a division by zero. You can try adding an epsilon to the denominator in the loss: https://github.com/mihaidusmanu/d2-net/blob/2a4d88fbe84961a3a17c46adb6d16a94b87020c5/lib/loss.py#L126-L129

        loss = loss + (
            torch.sum(scores1 * scores2 * F.relu(margin + diff)) /
            (torch.sum(scores1 * scores2) + 1e-5)
        )

bozhenhhu commented 4 years ago

@mihaidusmanu Thank you very much. I viewed the wieghts after 11 epoches which were very large. So I add learning rate decay and weight decay during training. This nan phenomenon diappeared.

mihaidusmanu / d2-net

loss is nan after 12 epochs #57