stefanknegt / Probabilistic-Unet-Pytorch

A Probabilistic U-Net for segmentation of ambiguous images implemented in PyTorch
Apache License 2.0
270 stars 67 forks source link

kl loss is nan #14

Closed xihuanzhu closed 3 years ago

xihuanzhu commented 3 years ago

Thanks for the code, i have a question, when i use my own picture data to train,in the class AxisAlignedConvGaussian, self.encoder = Encoder(self.input_channels, self.num_filters, self.no_convs_per_block, initializers, posterior=self.posterior) self.conv_layer = nn.Conv2d(num_filters[-1], 2 * self.latent_dim, (1,1), stride=1) the conv_layer will always output a large value(1000+),when it's output is uesd for this code dist = Independent(Normal(loc=mu, scale=torch.exp(log_sigma)),1) because of "torch.exp", it will output NAN,so i want to know why there is no need to add "torch.sigmoid" to limit the value after the conv_layer

stefanknegt commented 3 years ago

Hi! I have implemented the architecture as in the original paper (link). In all experiments I ran, I never had this issue so I am not sure what is causing it.

xihuanzhu commented 3 years ago

Hi! I have implemented the architecture as in the original paper (link). In all experiments I ran, I never had this issue so I am not sure what is causing it.

Thank you for your reply, I ran it directly with your code, except for the data, everything else is the same, maybe I should try using LIDC data

stefanknegt commented 3 years ago

That would be great, if it happens there then something odd is happening and I can look into it. Good luck!

xihuanzhu commented 3 years ago

That would be great, if it happens there then something odd is happening and I can look into it. Good luck!

That would be great, if it happens there then something odd is happening and I can look into it. Good luck!

Hi, i'm back. Thank you for your reply. I have used LIDC data you processed to train,It was normal when it started training, and then it was abnormal:

`in elbo: reconstruction_loss is 2312.861083984375, kl is 17.464019775390625

in elbo: reconstruction_loss is 2914.47802734375, kl is 31.355321884155273

in elbo: reconstruction_loss is 1369.12255859375, kl is 37.53509521484375

in elbo: reconstruction_loss is 1478.605712890625, kl is 55.792762756347656

in elbo: reconstruction_loss is nan, kl is nan

in elbo: reconstruction_loss is nan, kl is nan

in elbo: reconstruction_loss is nan, kl is nan`

This happened in the epoch 1 and step 404.But sometimes it’s completely normal during training.I don’t know why sometimes it’s normal and sometimes it’s abnormal

stefanknegt commented 3 years ago

Hmm, I do remember that in very rare occassions I also had these issues. But I couldn't figure out what was causing it to be honest.