nan in loss - Githubissues

kumar-shridhar / PyTorch-BayesianCNN

Bayesian Convolutional Neural Network with Variational Inference based on Bayes by Backprop in PyTorch.

MIT License

1.43k stars 325 forks source link

nan in loss #8

Closed jrubin01 closed 5 years ago

jrubin01 commented 5 years ago

Running both main_Bayes.py and Bayesian_CNN_Detailed.ipynb in the 'Image Recognition' folder give losses that result in nan, due to the kl returned from net.probforward(x).

kumar-shridhar commented 5 years ago

Yes it is due to a different sampling technique applied. But please visit in 2-3 days and new updated code will be pushed.

pankajb64 commented 5 years ago

FWIW, this PR on the mirror repo fixes the NaN issue (on my local machine atleast) on main_Bayes.py. - https://github.com/felix-laumann/Bayesian_CNN/pull/7 (except replace math.log with torch.log)

The argument expected by logpdf and sample in the Normal distribtuion class is logvar, but the actual value passed is std_dev.

fbiying87 commented 5 years ago

Is the issue already solved? I also have the problem of getting Loss=nan, by executing the main_Bayes.py script. I also run into the problem of CUDA error: out of memory. Does anyone else also have the same issue? Thanks!

smentu commented 5 years ago

I believe I am also having the issue with nan loss. What versions of Torch and the other dependencies did you use in the development? Would it be possible to get something like a pip freeze or a conda list listing the versions that are guaranteed to work? Thank you in advance.

kumar-shridhar commented 5 years ago

Due to conference proceedings and me being out for a couple of days have delayed things a bit. I am sorry but I will update the code with requirements files in 2 weeks. With new updates on Uncertainty measures. Thank you.

alexander-pv commented 5 years ago

Got the same nans. I think that the problem is in weights sampling: weight = self.weight.sample() in fcprobforward. I found out that logvar in sample() method of Normal class in BBBdistributions.py becames too big to store in memory. I managed to get rid of nans by setting q_logvar_init in BBBLinearFactorial class to a negative value (-5). It helped to reduce the amount of variance in fc_qw_std which is set in self.fc_qwstd.data.fill(self.q_logvar_init). Not sure for now if it is a correct to do.

lemonahmas commented 5 years ago

Think i have met the problem that the kl divergence come out from main_bayes.py is nan.

kumar-shridhar commented 5 years ago

I think the nan problem is fixed. I did not verify it though fully.

gunshi commented 5 years ago

Hey, is the nan loss problem fixed? Saw the note in the readme, so just wondering if the current code is correct.

kumar-shridhar commented 5 years ago

Code is fixed and up and running. Sorry for the delay.