Open ferrine opened 7 years ago
Hi Maxim,
No problem. However, currently the LS Sampler code has a little problem. I am going to fix it tonight.
You can use it tomorrow.
Thanks, Yihao
Hi Maxim,
I have updated the LS Sampler for bayesian logistic regression. You are free to use the code.
Thanks, Yihao
@lewisKit I've implemented asvgd (without sampler) and tested it for mean field approximation. It performs much better in sense of elbo. I mean that advi_elbo<asvgd_elbo. Why that can happen so?
You can find implementation here https://github.com/pymc-devs/pymc3/pull/2183
I found that sample size matters a lot for inference, is there any intuition for choosing it?
The sample size matters a lot maybe because we divide n for stein gradient. This should be related. In the original code we try to "normalize" the stein gradient by dividing sum(kxy), and this trick seems to be more stable than dividing n. We also find some other kernels may perform better than the current default gaussian rbf kernel.(paper This kernel performs better in the GMM case).
And also adding a temperature term helps(The trick is introduced here).
I think we are going to write a practical guide to summary these tricks in this summer.
I haven't done any comparison between advi and asvgd so it's hard to say.
I'm currently comparing ADVI and MeanField ASVGD and found that ASVGD fails for middle dimensional problem (link).
The same happens for other models. The behavior is always the same. It finds MAP and then learns to sample from it. More interesting is that in case of simple rbf kernel ELBO is decaying(!) after it was maximized. So I have some U
figure on ELBO-trace plot (the reason for mean field approx in experiments).
Here is some experiments with tracking elbo for asvgd in moons classification problem.
I sample exactly 30 particles
inference1 is meanfield-asvgd with default rbf kernel (as here)
inference2 is with h
from this paper
Even if I have a mistake in math somewhere it performs better here
You can see that inference1's negative elbo is growing
The reason why ASVGD fails for some middle or high dimension problems is that the second repulsive term is not big enough, multiplying a temperature alpha may help.
The sample size matters a lot maybe because we divide n for stein gradient. This should be related. In the original code we try to "normalize" the stein gradient by dividing sum(kxy), and this trick seems to be more stable than dividing n.
Your original code works bad for simple BEST model and simple FullRank sampler
multiplying by 100 here lead me to bias in all estimates. Dividing by sum(kxy) did not change optima significantly (it remained bad)
I've done some experiments comparing ADVI and ASVGD here. It might be interesting for you
Hi ferrine, really nice work! thanks very much for the work.!
We also realized the current RBF kernel is hard to tune, it's performance varies a lot when changing the scale of the bandwidth and the temperature. Currently we are investigating some more suitable kernel and other tricks to make it more stable. Hopefully we could include these in the updated paper or the blog.
Sounds good, I'm reading https://arxiv.org/pdf/1705.07107.pdf now. They seem to have interesting findings in this field
Yes, it is another new method related to this. In practice you only need to multiply the inverse of kxy to the second repulsive term. I used it in the VAE case but I still need to make the temperature larger to make it work better, but I am not sure if its performance in other cases.
Hi! You did great job! I'm really inspired by this repo. I've discovered ASVGD for myself from recent papers:
http://bayesiandeeplearning.org/papers/BDL_21.pdf https://arxiv.org/pdf/1611.01722.pdf
Today I've started to port ASVGD to pymc3 using OPVI framework and accidentally found your repository. Don't you mind if I use some code from your repo in pymc3? LS Sampler is the most interesting part.