lewisKit / Amortized_SVGD

Experiments of amortized stein variational gradient
16 stars 8 forks source link

Porting to pymc3 #1

Open ferrine opened 7 years ago

ferrine commented 7 years ago

Hi! You did great job! I'm really inspired by this repo. I've discovered ASVGD for myself from recent papers:

http://bayesiandeeplearning.org/papers/BDL_21.pdf https://arxiv.org/pdf/1611.01722.pdf

Today I've started to port ASVGD to pymc3 using OPVI framework and accidentally found your repository. Don't you mind if I use some code from your repo in pymc3? LS Sampler is the most interesting part.

lewisKit commented 7 years ago

Hi Maxim,

No problem. However, currently the LS Sampler code has a little problem. I am going to fix it tonight.

You can use it tomorrow.

Thanks, Yihao

lewisKit commented 7 years ago

Hi Maxim,

I have updated the LS Sampler for bayesian logistic regression. You are free to use the code.

Thanks, Yihao

ferrine commented 7 years ago

@lewisKit I've implemented asvgd (without sampler) and tested it for mean field approximation. It performs much better in sense of elbo. I mean that advi_elbo<asvgd_elbo. Why that can happen so?

ferrine commented 7 years ago

You can find implementation here https://github.com/pymc-devs/pymc3/pull/2183

ferrine commented 7 years ago

I found that sample size matters a lot for inference, is there any intuition for choosing it?

lewisKit commented 7 years ago

The sample size matters a lot maybe because we divide n for stein gradient. This should be related. In the original code we try to "normalize" the stein gradient by dividing sum(kxy), and this trick seems to be more stable than dividing n. We also find some other kernels may perform better than the current default gaussian rbf kernel.(paper This kernel performs better in the GMM case).

And also adding a temperature term helps(The trick is introduced here).

I think we are going to write a practical guide to summary these tricks in this summer.

lewisKit commented 7 years ago

I haven't done any comparison between advi and asvgd so it's hard to say.

ferrine commented 7 years ago

I'm currently comparing ADVI and MeanField ASVGD and found that ASVGD fails for middle dimensional problem (link).

The same happens for other models. The behavior is always the same. It finds MAP and then learns to sample from it. More interesting is that in case of simple rbf kernel ELBO is decaying(!) after it was maximized. So I have some U figure on ELBO-trace plot (the reason for mean field approx in experiments).

ferrine commented 7 years ago

Here is some experiments with tracking elbo for asvgd in moons classification problem.

I sample exactly 30 particles inference1 is meanfield-asvgd with default rbf kernel (as here) inference2 is with h from this paper image Even if I have a mistake in math somewhere it performs better here You can see that inference1's negative elbo is growing image

lewisKit commented 7 years ago

The reason why ASVGD fails for some middle or high dimension problems is that the second repulsive term is not big enough, multiplying a temperature alpha may help.

ferrine commented 7 years ago

The sample size matters a lot maybe because we divide n for stein gradient. This should be related. In the original code we try to "normalize" the stein gradient by dividing sum(kxy), and this trick seems to be more stable than dividing n.

Your original code works bad for simple BEST model and simple FullRank sampler

multiplying by 100 here lead me to bias in all estimates. Dividing by sum(kxy) did not change optima significantly (it remained bad)

ferrine commented 7 years ago

I've done some experiments comparing ADVI and ASVGD here. It might be interesting for you

lewisKit commented 7 years ago

Hi ferrine, really nice work! thanks very much for the work.!

We also realized the current RBF kernel is hard to tune, it's performance varies a lot when changing the scale of the bandwidth and the temperature. Currently we are investigating some more suitable kernel and other tricks to make it more stable. Hopefully we could include these in the updated paper or the blog.

ferrine commented 7 years ago

Sounds good, I'm reading https://arxiv.org/pdf/1705.07107.pdf now. They seem to have interesting findings in this field

lewisKit commented 7 years ago

Yes, it is another new method related to this. In practice you only need to multiply the inverse of kxy to the second repulsive term. I used it in the VAE case but I still need to make the temperature larger to make it work better, but I am not sure if its performance in other cases.