NaN in loss while training the log model

kishore-greddy commented 3 years ago

Hey @mattpoggi ,

I was trying to train the log model. I made necessary changes to the decoder to include the additional channel. When I start training, the intial loss is NaN and then after some batches it is NaN again. I was debugging the issue and stumbled upon this piece of code from your decoder.py

1) In line 81, sigmoid is used as the original code from monodepth2, but I do not see sigmoid being used for uncerts in line 85, Is there any reason for this?

2) I train on the GPU, but for debugging I use the CPU. While debugging on my CPU with batch_size 2 (any size greater will cause memory issues), I used breakpoints to see the values of uncert.

As seen in the image, the min value is negative, Log of a negative number is NaN. This made me ask the first question, why the uncerts are not clamped between 0(possibly a tiny bit greater to avoid inf when log is taken in the loss function) and 1. Is my understanding right or have I misunderstood something?

3) My loss function is

 Is there a problem with this? Do you also use the "to_optimise" which is a 
 min(reprojection_losses and identity losses) or just the original reprojection 
 losses?

EDIT : After reading quite a lot, I feel that my log loss is wrong. Maybe the uncertainities coming at the output channel are already \log(uncertainties) , so I would have to correct my loss function to below?

EDIT 2: Would the above edit hold good for the self teaching loss too, meaning the uncertainity outputs are actually the \log(uncertainties), so I have to take torch.exp() in the loss.?

Thanks in advance

mattpoggi commented 3 years ago

Hi @kishore-greddy 1-2. the uncertainty is usually unbounded but might lead your network to instability. As you noticed in your "EDIT", if you model log-uncertainty you should fix the problem.

this seems correct. You divide by the uncertainty the min term. EDIT 2: yes, the trick works for self teaching as well (but you do not have the minimum among multiple reprojection losses, only a single L1 loss wrt the proxy labels of the teacher)

Hope this helps ;)

kishore-greddy commented 3 years ago

Hey @mattpoggi ,

Thanks for the quick reply. I will try this out.

kishore-greddy commented 3 years ago

Hi @mattpoggi ,

Forgot to ask, Have you also tried the other method? Meaning, keeping the uncertainty values greater than 0 in the decoder and actually modelling for the uncertainty instead of log(uncertainty) where my loss function in 3) works. I read about Negative Log Likelihood minimization and a lot of people talk about taking the \log in the loss rather than modelling the log uncertainty itself.

Quoted from Lakshminarayanan et al, "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles", one of the papers referenced by you in your research. Here they talk about variance greater than 0. Could you please clarify? Thanks

mattpoggi commented 3 years ago

I made some experiments by bounding the uncertainty in 0-1 with a sigmoid layer and adding the log term in the loss function, as you mentioned. The same strategy is used in D3VO paper (https://vision.in.tum.de/research/vslam/d3vo). The numbers where almost identical in the two formulations. I believe the important thing is just to avoid exploding gradients and unstable behaviors.

kishore-greddy commented 3 years ago

Hey @mattpoggi ,

I tried to model the log-uncertainty as you suggested, without binding the uncertainty to any range. I have exploding gradients problem. I have updated my loss function to be the one below,

After some iterations, in the first epoch itself, I face issues, please have a look at the image below,

Notice the loss just before I run into problems. Did you ever have to deal with something like this? Any hint is appreciated, Thanks.

EDIT: I managed to set a breakpoint just before the gradients exploded. Added a new image which shows the minimum value of the output uncertainties(in fact log uncertainties) for all images in the batch. As you can see , the minimum value coming at the output channel is -33.99, if we take exp(-33.99) it would come up to the order of 10^-15, and this being in the denominator is causing the loss value to blow up. I tried finding reasons why this is happening and I am not quite sure. Any guidance is highly appreciated. Thanks

mattpoggi commented 3 years ago

That's quite weird, I actually never had a problem with gradients... Does this occur at any training? Does this occur even if you use the sigmoid trick? Anyway, before the gradients explode, the loss numbers are very similar to the ones I had seen during my experiments.

kishore-greddy commented 3 years ago

Hi @mattpoggi , I observed that this occurs almost at every training of log model. I have tried it 3 times now, and every time I have this problem. Sometimes the problem occurs at the 5th epoch, sometimes the 1st epoch itself, so that is not consistent. However, as I showed the loguncertainty values just before the gradients start to explode, the min value is -33, Network is predicting this value at some pixel. I am not sure why this problem is so random and even more surprised that you did not face any issues like this. My decoder is almost the same as yours and I have also posted the loss function. Do you find an issue there? Because that is the only thing that is different. I have not used the sigmoid trick yet, I wanted to train the model as you did.

mattpoggi commented 3 years ago

You properly upsampled the uncertainty to the proper resolution scale, right? I can dig more into this after the CVPR rebuttal occurring this week... Just a few questions: 1) are you training on KITTI? 2) are you using M, S or MS?

kishore-greddy commented 3 years ago

Do you mean scaling of the uncertainty to full resolution before calculating the loss? Yes, I have done that.

If you mean upsampling of the uncertainties in the decoder, Yes, I have done that too,

class DepthDecoder(nn.Module):
    def __init__(self, num_ch_enc, scales=range(4), num_output_channels=1, use_skips=True, use_uncert=False):
        super(DepthDecoder, self).__init__()

        self.num_output_channels = num_output_channels
        self.use_skips = use_skips
        self.upsample_mode = 'nearest'
        self.scales = scales
        self.use_uncert = use_uncert
        self.num_ch_enc = num_ch_enc
        self.num_ch_dec = np.array([16, 32, 64, 128, 256])

        # decoder
        self.convs = OrderedDict()
        for i in range(4, -1, -1):
            # upconv_0
            num_ch_in = self.num_ch_enc[-1] if i == 4 else self.num_ch_dec[i + 1]
            num_ch_out = self.num_ch_dec[i]
            self.convs[("upconv", i, 0)] = ConvBlock(num_ch_in, num_ch_out)

            # upconv_1
            num_ch_in = self.num_ch_dec[i]
            if self.use_skips and i > 0:
                num_ch_in += self.num_ch_enc[i - 1]
            num_ch_out = self.num_ch_dec[i]
            self.convs[("upconv", i, 1)] = ConvBlock(num_ch_in, num_ch_out)

        for s in self.scales:
            self.convs[("dispconv", s)] = Conv3x3(self.num_ch_dec[s], self.num_output_channels)
            if self.use_uncert:
                self.convs[("uncertconv", s)] = Conv3x3(self.num_ch_dec[s], self.num_output_channels)

        self.decoder = nn.ModuleList(list(self.convs.values()))
        self.sigmoid = nn.Sigmoid()

    def forward(self, input_features):
        self.outputs = {}

        # decoder
        x = input_features[-1]
        for i in range(4, -1, -1):
            x = self.convs[("upconv", i, 0)](x)
            x = [upsample(x)]
            if self.use_skips and i > 0:
                x += [input_features[i - 1]]
            x = torch.cat(x, 1)

            x = self.convs[("upconv", i, 1)](x)

            if i in self.scales:
                self.outputs[("disp", i)] = self.sigmoid(self.convs[("dispconv", i)](x))
                if self.use_uncert:
                    self.outputs[("uncert", i)] = self.convs[("uncertconv", i)](x)
        return self.outputs

1) I am training on the eigen zhou split of KITTI dataset (monodepth2 default) 2) I am training the M model.

mattpoggi commented 3 years ago

Everything looks good. I'll try to give a look at it next week

kishore-greddy commented 3 years ago

Thanks :) Would be waiting for yor inputs

mattpoggi commented 3 years ago

I launched a single train and it ended without issues. I'll try a few more times

kishore-greddy commented 3 years ago

Okay..Let me know how it goes..

IemProg commented 3 years ago

Hi,

Wonderful work, and thanks for sharing the code. I'm working on training the model with log loss to estimate uncertaitny. But, i'm facing the exploding gradient issue.

Have you fixed the exploding gradient issue with log_loss ?

Thanks !

mattpoggi commented 3 years ago

Hi, sorry for late. Are you trying to estimate log uncertainty as we mentioned in the previous comments? Among them, we also mentioned using a sigmoid in place of modeling the log uncertainty (https://github.com/mattpoggi/mono-uncertainty/issues/13#issuecomment-761558919). I used this in some following up works and seems extremely stable, yet giving equivalent results.

Abdulaaty commented 2 years ago

@kishore-greddy @IemProg one of the reasons might be the batch size you're using. I had a similar experience in another framework where the training goes to instability if you use small batch size (like 1 or 2). If you use a different batch size than the one used in the paper that might be the issue.

@mattpoggi could you please confirm this by trying to set the training batch size to 1 and see if you experience exploding/vanishing gradients?

mattpoggi / mono-uncertainty

NaN in loss while training the log model #13