izmailovpavel / understandingbdl

Other
229 stars 38 forks source link

SGLD Optimizer #24

Closed nm19000 closed 1 year ago

nm19000 commented 1 year ago

Hi, I was trying to implement SGLD according to http://people.ee.duke.edu/~lcarin/398_icmlpaper.pdf, equation (4) and came across your version of an SGLD Optimizer. I just wanted to ask how your gradient updates relate to the updates proposed in eq. 4 in the original paper? So for instance there is the factor N/n (full training set size/batch size) which I cannot find and the choice of the added noise eta_t scales with the learning rate in the original paper, which it does not in your implementation. Is there a reason for this?

Moreover, in your paper you report taking the last samples from a number of chains and then computing the std and mean, however in the original paper it also seems like you could take multiple samples from the same chain and still obtain a sufficient variance. Do you know if it is possible to get this with the proposed optimizer?

Many thanks in advance, Nina

izmailovpavel commented 1 year ago

Hey Nina (@nm19000)!

I should mention that we wrote this code a while ago (I think back in 2019) and haven't really used it since. It could make sense to explore some other implementations, such as the one here.

For the N/n factor, I believe the assumption in the code that you do the rescaling when you evaluate the loss. You should use the loss which is a stochastic estimate of the posterior log-density (not normalized by the number of datapoints).

For the noise term, in this line we add $\eta_t$ to the gradient. Note that our group['lr'] is $\epsilon_t / 2$ in the notation of the paper. The noise_factor should be 1, I believe it was used to do posterior tempering. Then noise_factor * (2.0 * group['lr']) ** 0.5 * torch.randn_like(d_p) is $\sqrt{\epsilon_t} \cdot \xi$, where $\xi \sim N(0, I)$. In other words, it is a sample $\eta_t \sim N(0, \epsilon_t)$, same as in the paper.

In the paper we just used the last sample from each chain, but you can totally use multiple samples from each chain as you suggested. The optimizer we implemented will just work in the same way as SGD, meaning that it will just produce a training trajectory for you. You can save as many samples as you want along the trajectory.

Please let me know if you have other questions, Pavel

nm19000 commented 1 year ago

Hi Pavel,

many thanks for your answer and explanations!

Best Nina

izmailovpavel commented 1 year ago

@nm19000 you may also want to check out this implementation of SGLD: https://github.com/activatedgeek/torch-sgld