Getting NAN tensor from encoder

zabboud commented 2 years ago

Hello - I've been getting this issue consistently while running the code as is, with the LIDC-IDRI data (downloaded from the provided link). This error is caught at the line dist = Independent(Normal(loc=mu, scale=torch.exp(log_sigma)),1) within the class AxisAlignedConvGaussian() due to the fact that mu = tensor([[nan, nan], [nan, nan], [nan, nan], [nan, nan], [nan, nan]], device='cuda:0', grad_fn=<SliceBackward0>).

When I trace back where nan is coming from, it's directly from the encoder (output from encoding = self.encoder(input)) all the way back to the output from the forward method in the Encoder class.

this issue seems to be persistent regardless of batch size (I've run it with batch size 5 and 10, and I still get the error within the first epoch, randomly after a few runs).

I've verified the input, it seems okay, the images are what is expected (viewed) and some masks have all 0's while others have some values. Nothing out of the ordinary.

I have yet to be able to track down why this is occurring. It seems like others have experienced a similar issue, but more on the loss side, the issue I'm experiencing is within the forward pass, so it is independent of the loss.

Any insight would be appreciated! ValueError: Expected parameter loc (Tensor of shape (10, 2)) of distribution Normal(loc: torch.Size([10, 2]), scale: torch.Size([10, 2])) to satisfy the constraint Real(), but found invalid values: tensor([[nan, nan], [nan, nan], [nan, nan], [nan, nan], [nan, nan], [nan, nan], [nan, nan], [nan, nan], [nan, nan], [nan, nan]], device='cuda:0', grad_fn=<SliceBackward0>)

JasperLinmans commented 2 years ago

Hi Zabboud,

I'm working on my own implementation, so I can't comment on this exact codebase. But what I found: changing the initialisation (in this case from nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu') to nn.init.normal_(m.weight, std=0.001) solves the problem for me.

Curious if you found some more information on this in the meantime!

zabboud commented 2 years ago

Hi Zabboud,

I'm working on my own implementation, so I can't comment on this exact codebase. But what I found: changing the initialisation (in this case from nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu') to nn.init.normal_(m.weight, std=0.001) solves the problem for me.

Curious if you found some more information on this in the meantime!

Actually changing the learning rate (decreasing it) fixed the problem for me. I'm still unsure as to why it happens, do you have an idea of why it happens? I'd be interested to test out the different initialization!

stefanknegt commented 2 years ago

The model is quite sensitive and a too high learning rate and some initialization methods can cause the loss to go to NaN.

zabboud commented 2 years ago

Thank you - I figured out that part -- I was wondering if you have some insight on why with other datasets the loss seems not to decrease, however, I can see that the predictions are improving through visual feedback -- any insight on this issue?

stefanknegt commented 2 years ago

Hmm I think you should look at the 2 components of the loss function and how they evolve over time. Maybe this can give you some insight into why the loss is not decreasing while the predictions seem to improve.

zabboud commented 2 years ago

Both the total ELBO loss and the KL loss are just stagnant - there's little to no change. Do you have any suggestions on what to tune from the parameters (whether it's latent dimension, gamma, beta, num_convs_fcomb)? I've been playing around with the preprocessing of the data (liver dataset) -but with no luck to make the model learn to predict lesion location.

I've tested the model on the lung dataset - and it works, I have some diversity in the predictions, and there's a progression in the loss - but unfortunately no progress on the liver dataset, whether in predicting liver or lesion.

stefanknegt commented 2 years ago

I am not sure why that happens and guess that changing things like the latent dimension and num_convs_fcomb is not going to help. I've only tested it on the LIDC and although I sometimes had issues with the loss, it never remained stagnant. Good luck!

zabboud commented 2 years ago

Thank you - I realized that often the KL divergence term goes to 0 -- what would be the cause of that? Probably an indicator of why the model is not training properly

stefanknegt / Probabilistic-Unet-Pytorch

Getting NAN tensor from encoder #21