Question about SGLD optimizer

goldkim92 commented 6 years ago

Hi! It was impressive to see your code, especially using parent class (Optimizer) to make SGLD class in sgld_optimizer.py .

However I'm wondering if adding noise with gradient is a little bit wrong. According to the SGLD paper (and also in your ICML Workshop paper), it seems that

langevin_noise = Normal(torch.zeros(size), torch.ones(size) / group['lr'])
p.data.add_(-group['lr'], d_p + langevin_noise.sample().cuda())

based on the fact that lr * (Sample from N(0,1/lr)) is equal to Sample fron N(0,lr)

JavierAntoran commented 5 years ago

I have the same question as goldkim92. Additionally, in the original SGLD paper, they consider the effect of the prior through the gradient of log p(\theta). I believe your implementation ignores this term as it discards weight decay.

henripal commented 5 years ago

Sorry - hadn't looked at this in a while! I spend a little time and I think you're definitely right - fixed in 60bd5646a2f25885b1b1f70f5db4ecc4d6481b26. I'm still taking the sqrt of the term you suggest in implementation as pytorch takes a scale parameter and not a variance parameter.

henripal commented 5 years ago

Also @JavierAntoran - yes, I intentionally discard the prior term.

JavierAntoran commented 5 years ago

Also @JavierAntoran - yes, I intentionally discard the prior term.

Thanks for the quick reply. Could you explain why you make this decision?

henripal commented 5 years ago

Ha, maybe github issue not the best place to talk about this but:

priors in NN don't have the same meaning that in normal bayesian inference - they're just a numerical initialization thing
in the effective SGLD scheme, the effect of the prior is long gone - what determines the distribution is the local profile of the loss surface + the injected noise. Adding a gradient term related to the prior to this when there's no more prior impact on the posterior woudn't make sense. I wrote this down formally in my thesis; will share when it's up!

JavierAntoran commented 5 years ago

Hmm, I'm not sure I'm following your reasoning.

As I understand it, the goal of SGLD is to eventually draw samples from the posterior over the weights. We can then do Bayesian inference by integrating out the weights of the NN. In order for there to be a posterior distribution over weights, there needs to be a prior distribution. Otherwise, you are just doing noisy maximum likelihood?
Something like a Gaussian prior will penalise large weights. Intuitively, this will affect the shape of the loss surface as weight configurations which explain the data well but contain large magnitude weights will become less attractive.

I would like to read your thesis once it is published. We can continue this conversation via email if you prefer. You can find me at: javier(dot)a(dot)es(at)ieee(dot)org

henripal commented 5 years ago

Agree partially! Emailing you and closing this issue - thanks all!

henripal / sgld

Question about SGLD optimizer #1