Variational inference with AD?

datnamer commented 9 years ago

Can the theano infrastructure handle this?

https://github.com/stan-dev/stan/pull/1421 http://andrewgelman.com/2015/02/18/vb-stan-black-box-black-box-variational-bayes/

jsalvatier commented 9 years ago

We don't currently have this implemented, but I suspect it would fit well into pymc3.

On Wed, May 13, 2015 at 4:27 PM, datnamer notifications@github.com wrote:

Can the theano infrastructure handle this?

stan-dev/stan#1421 https://github.com/stan-dev/stan/pull/1421

http://andrewgelman.com/2015/02/18/vb-stan-black-box-black-box-variational-bayes/

— Reply to this email directly or view it on GitHub https://github.com/pymc-devs/pymc3/issues/708.

twiecki commented 9 years ago

Neat! This would be fantastic and I think theano has some features that would make this work without too much hassle.

datnamer commented 9 years ago

It would be :) Just to be clear though, this was more of a "I'll just leave this right here" kinda deal ...I would have no idea where to even start.

twiecki commented 9 years ago

Understood :). Looking at the paper it actually doesn't look that bad.

fonnesbeck commented 9 years ago

I've been looking over that paper with a student of mine. Looks promising.

fonnesbeck commented 9 years ago

Relatedly, there is this. Nice to see a few worthy alternatives to MCMC showing up.

datnamer commented 9 years ago

Wow. Flexible inference on larger datasets would be a killer app and also ameliorate python's dearth of statistical modeling packages.

twiecki commented 9 years ago

They are using rmprop to maximize. There are two python implementations: One that works with theano from the theanets library: http://theanets.readthedocs.org/en/latest/generated/theanets.trainer.RmsProp.html One from climin: http://climin.readthedocs.org/en/latest/rmsprop.html

twiecki commented 9 years ago

is what we need to figure out how to compute.

syclik commented 9 years ago

It's the stochastic estimate of the gradient. If you're using theano, this should be cheaper than in Stan because it's already done symbolically, I think.

@akucukelbir, thoughts?

akucukelbir commented 9 years ago

@syclik thanks for alerting me to this thread. Very exciting.

The notation is really poor in that workshop paper; I threw it together too quickly. Expect a much more readable (and thorough) arxiv preprint soon.

I'm not too familiar with pymc or theano, but the primary reason our algorithm works so well in Stan are the automatic transformations. Without that, you'd have to specify the variational approximation.

Do we have transformations in pymc? If so, given that we've worked out a lot of the kinks in Stan, it shouldn't be hard to implement in pymc too.

@twiecki we only use rmsprop in the context of "forgetting" about past gradients. I think it would be more accuracy to say we use a windowed version of adaGrad with stochastic gradient ascent.

@datnamer the arxiv preprint will present results of using a data subsampling strategy with this algorithm. The speed improvements are dramatic.

twiecki commented 9 years ago

@akucukelbir @syclik Thanks for chiming in here!

Regarding the question of transformations, they do exist, but are not automatic (although that would probably not be too hard). Here is an example: https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/stochastic_volatility.py#L56

Your help in getting this implemented in pymc3 would be much appreciated!

akucukelbir commented 9 years ago

Great! Could we chat about this after June 5? I'd love to talk in more detail. (I'll put it in my calendar, but just to be safe could one of you also ping me?)

twiecki commented 9 years ago

Sounds good!

jsalvatier commented 9 years ago

I'd love to chat with you guys as well.

On Wed, May 27, 2015 at 8:10 AM Thomas Wiecki notifications@github.com wrote:

Sounds good!

— Reply to this email directly or view it on GitHub https://github.com/pymc-devs/pymc3/issues/708#issuecomment-105949938.

twiecki commented 9 years ago

@akucukelbir Want to try and find a time to chat about that?

akucukelbir commented 9 years ago

absolutely. the arXiv paper should be up in a day or two. i'll let you guys know on this thread, and we can schedule a time. (having the paper in front of us would help guide the discussion, i think.) sound good?

twiecki commented 9 years ago

:+1:

akucukelbir commented 9 years ago

hi folks. ok so i'm having some problems with arXiv. (should be resolved soon.) in the meantime, i'll just host the preprint on my website. here it is:

http://www.proditus.com/papers/KRGB_preprint.pdf

are we all in similar time zones? (i'm in new york.) if so, we can maybe start proposing some time-slots to chat?

looking forward!

twiecki commented 9 years ago

Sounds great. I'm in CET, @jsalvatier is in PT I think. So something like 11am EST might work well for everyone.

fonnesbeck commented 9 years ago

I'm in Central, so anything works for me.

akucukelbir commented 9 years ago

if we're shooting for 11am eastern, then the earliest i could do is friday (jun 12).

if later in day is a possibility, then 3pm eastern today (jun 10) is also an option.

twiecki commented 9 years ago

Friday would be best for me. On Jun 10, 2015 3:48 PM, "Alp Kucukelbir" notifications@github.com wrote:

if we're shooting for 11am eastern, then the earliest i could do is friday (jun 12).

if later in day is a possibility, then 3pm eastern today (jun 10) is also an option.

— Reply to this email directly or view it on GitHub https://github.com/pymc-devs/pymc3/issues/708#issuecomment-110760799.

jsalvatier commented 9 years ago

11am EST works fine for me. I'm jsalvatier on both skype and gmail.

On Wed, Jun 10, 2015 at 7:39 AM, Thomas Wiecki notifications@github.com wrote:

Friday would be best for me. On Jun 10, 2015 3:48 PM, "Alp Kucukelbir" notifications@github.com wrote:

if we're shooting for 11am eastern, then the earliest i could do is friday (jun 12).

if later in day is a possibility, then 3pm eastern today (jun 10) is also an option.

— Reply to this email directly or view it on GitHub https://github.com/pymc-devs/pymc3/issues/708#issuecomment-110760799.

— Reply to this email directly or view it on GitHub https://github.com/pymc-devs/pymc3/issues/708#issuecomment-110778049.

syclik commented 9 years ago

I can make it too. On Jun 10, 2015 1:46 PM, "John Salvatier" notifications@github.com wrote:

11am EST works fine for me. I'm jsalvatier on both skype and gmail.

On Wed, Jun 10, 2015 at 7:39 AM, Thomas Wiecki notifications@github.com wrote:

Friday would be best for me. On Jun 10, 2015 3:48 PM, "Alp Kucukelbir" notifications@github.com wrote:

if we're shooting for 11am eastern, then the earliest i could do is friday (jun 12).

if later in day is a possibility, then 3pm eastern today (jun 10) is also an option.

— Reply to this email directly or view it on GitHub <https://github.com/pymc-devs/pymc3/issues/708#issuecomment-110760799 .

— Reply to this email directly or view it on GitHub https://github.com/pymc-devs/pymc3/issues/708#issuecomment-110778049.

— Reply to this email directly or view it on GitHub https://github.com/pymc-devs/pymc3/issues/708#issuecomment-110849450.

fonnesbeck commented 9 years ago

Why don't we use appear.in. Works in the browser without any account requirements or software. Let's say:

https://appear.in/pymc3

Sound good?

akucukelbir commented 9 years ago

perfect. friday at 11am eastern on appear.in. looking forward!

akucukelbir commented 9 years ago

wow. i didn't realize you guys had HMC. i'll be sure to cite the 2010 paper for the camera-ready version of our paper. that's a big omission on my part. apologies.

talk to you in 2 hours!

fonnesbeck commented 9 years ago

Actually, the 2010 paper is pre-HMC. We are working on something now, but it is not yet published or in press.

twiecki commented 9 years ago

We can put something up on arxiv so that it's citable.

twiecki commented 9 years ago

https://github.com/JonathanRaiman/gradient_optimizers has adagrad.

akucukelbir commented 9 years ago

hey guys. great chat!

so here are a few snippets, as promised. all the files below are under stan/src/stan/variational.

this is from the function calc_combined_grad on line 237 of advi.hpp. it implements equations (5) and (6) from the paper.

+        // Naive Monte Carlo integration
+        for (int i = 0; i < n_monte_carlo_grad_; ++i) {
+          // Draw from standard normal and transform to real-coordinate space
+          for (int d = 0; d < dim; ++d) {
+            eta(d) = stan::math::normal_rng(0, 1, rng_);
+          }
+          zeta = muomega.loc_scale_transform(eta);
+
+          stan::math::check_not_nan(function, "zeta", zeta);
+
+          // Compute gradient step in real-coordinate space
+          stan::model::gradient(model_, zeta, tmp_lp, tmp_mu_grad,
+                                print_stream_);
+
+          // Update mu
+          mu_grad.array() = mu_grad.array() + tmp_mu_grad.array();
+
+          // Update omega
+          omega_grad.array() = omega_grad.array()
+            + tmp_mu_grad.array().cwiseProduct(eta.array());
+        }
+        mu_grad    /= static_cast<double>(n_monte_carlo_grad_);
+        omega_grad /= static_cast<double>(n_monte_carlo_grad_);
+
+        // Multiply by exp(omega)
+        omega_grad.array() =
+          omega_grad.array().cwiseProduct(muomega.omega().array().exp());
+
+        // Add gradient of entropy term (just equal to element-wise 1 here)
+        omega_grad.array() += 1.0;

and here is a snippet of the main loop, in function robbins_monro_adagrad on line 469 of advi.hpp

+        while (do_more_iterations) {
+          // Compute gradient using Monte Carlo integration
+          calc_combined_grad(muomega, mu_grad, omega_grad);
+
+          // Accumulate S vector for ADAgrad
+          mu_s.array()     += mu_grad.array().square();
+          omega_s.array()  += omega_grad.array().square();
+
+          // RMSprop-style moving average weighting
+          mu_s.array() = pre_factor * mu_s.array()
+                       + post_factor * mu_grad.array().square();
+          omega_s.array() = pre_factor * omega_s.array()
+                          + post_factor * omega_grad.array().square();
+
+          // Take ADAgrad or rmsprop-style step
+          muomega.set_mu( muomega.mu().array()
+            + eta_adagrad_ * mu_grad.array()
+            / (tau + mu_s.array().sqrt()) );
+          muomega.set_omega( muomega.omega().array()
+            + eta_adagrad_ * omega_grad.array()
+            / (tau + omega_s.array().sqrt()) );

finally, advi_params_meanfield.hpp implements a class that represents the mean-field approximation.

datnamer commented 9 years ago

It seems like it is or was at one point possible to use out of core datasets with Pymc3, utilizing numpy memmap arrays: https://groups.google.com/forum/#!topic/pymc/6HrjuqtIVjo

If feasible, on disk datasets + more efficient inference would be a powerful combination. Not sure if that would change the implementation, but thought I would bring it up nonetheless.

jsalvatier commented 9 years ago

I love that idea. I don't think it will change the implementation too much. Do you have any particular model/dataset combinations in mind? It would be great to have a really impressive application.

jsalvatier commented 9 years ago

I'm working on a draft for this. Do we know any appropriate SGD optimizers in python. I could write my own, of course, but it would be great to just use prewritten optimizers.

jsalvatier commented 9 years ago

Er that was supposed to be a question. Do we know any nice python SGD optimizers?

aflaxman commented 9 years ago

Use sklearn, if possible, it's taking over the world: http://scikit-learn.org/stable/modules/sgd.html

datnamer commented 9 years ago

@jsalvatier It was sparked by this discussion on reddit: http://www.reddit.com/r/statistics/comments/3ab8os/best_tools_for_glmnet_gams_and_mixed_effects/

I agree this would be awesome. I have a possible dataset that I will try to dig up.

@mrocklin Has been building some out of core stuff in dask and blaze. It might be more memory/speed efficient to pass in a OOC dask array using the numpy buffer protocol, but I don't know if that would work with theano and if there is enough linalg support. Maybe Matt can comment on the potential utility, if any, of the blaze family here.

I second the sklearn suggestion for SGD. IIRC, it has a implementation for out of core learning using incremental partial fit. Most likely any Pymc 3 installation would be coexisting with Sklearn.

jsalvatier commented 9 years ago

@aflaxman thanks Abie, do you know if there's a good way to figure out how to use the internal optimizer?

twiecki commented 9 years ago

scikit-learn only has SGD, but the paper is using adagrad I think. lasagne seems to offer all of those for theano: https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py

jsalvatier commented 9 years ago

Thanks thomas, exactly what I was looking for.

On Fri, Jun 19, 2015 at 5:01 AM, Thomas Wiecki notifications@github.com wrote:

scikit-learn only has SGD, but the paper is using adagrad I think. lasagne seems to offer all of those for theano: https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py

— Reply to this email directly or view it on GitHub https://github.com/pymc-devs/pymc3/issues/708#issuecomment-113491031.

aflaxman commented 9 years ago

Looks like you found something even more suited than sklearn. If it seems like the sklearn route would still be helpful, let me know more what you are after, and I can take a look.

jsalvatier commented 9 years ago

Basically, I'm looking for any stochastic optimizers with a general interface in a similar way to how scipy.optimize has a general interface.

On Mon, Jun 22, 2015 at 5:29 AM Abraham Flaxman notifications@github.com wrote:

Looks like you found something even more suited than sklearn. If it seems like the sklearn route would still be helpful, let me know more what you are after, and I can take a look.

— Reply to this email directly or view it on GitHub https://github.com/pymc-devs/pymc3/issues/708#issuecomment-114085617.

jsalvatier commented 9 years ago

@datnamer were you ever able to dig up the dataset?

mjwillson commented 8 years ago

Hi all

First thing that occurred to me on reading this paper was "this would be great to add into PyMC". Glad to see it's already being worked on!

@jsalvatier Re AdaGrad implementations, there are a fair few theano-based "deep learning" packages around which should implement the common variants on SGD. The one I'm using is Blocks, which has implemented AdaGrad, AdaDelta etc here in a modular way which can be compiled as updates to theano shared variables:

https://blocks.readthedocs.org/en/latest/api/algorithms.html https://github.com/mila-udem/blocks/blob/master/blocks/algorithms/__init__.py#L722

Not sure quite how coupled those are to the rest of the library but might be a useful starting point, and aren't actually that much code anyway. You probably do want to implement this in a way that uses theano compiled functions to do the updates though, rather than a completely generic numpy-based optimiser routine. With theano the updates can run entirely on the GPU without having to transfer parameters between GPU and CPU (something that can kill the performance advantage).

If there's any way this could support streaming training batches (e.g. from a python iterator) rather than requiring the whole dataset to be loaded into memory, that would also be great. @datnamer numpy mmapped arrays might be one way around this but might restrict the way you use the library a bit.

I imagine most people working with seriously large datasets are happy to roll a bit of their own code if necessary to deliver minibatches to an algorithm here, or even to implement their own training loop provided you expose the right APIs. To my mind that might even be preferable than an "all or nothing" black box approach where you don't have any control over how the dataset is iterated over.

@akucukelbir I have a question actually about the data subsampling strategy mentioned in the paper. It describes the updates as being O(BMK) per minibatch, where K is the number of parameters, implying you update all the parameters for every batch.

How do you handle the case where you have IID data with latent variable(s) for each datapoint -- do you only update the variational parameters for the datapoints in the current batch, or do you need to maintain parameters for every point in the dataset and update them all on every batch? The latter seems impractical in a big data situation.

I can think of ways you might be able to implement the former approach with AVDI but it might require analysing the model's dependency structure a bit to find which latent variables it can treat this way, and then doing a few inner iterations of inference for the per-batch variational parameters for a batch before updating the shared variational parameters. IIRC Hoffman's online inference for LDA works this way for example.

akucukelbir commented 8 years ago

@mjwillson at the moment, we only support models with "global" latent variables. e.g. we marginalize out the "local" latent variables in a mixture model, for instance.

what you describe is indeed what SVI does. it's not immediately clear, however, how to implement that in ADVI.

mjwillson commented 8 years ago

@akucukelbir that makes sense, thanks for the response :) I can imagine it would be fiddly to get that working in a general setting.

akucukelbir commented 8 years ago

@mjwillson indeed. nevertheless, it's a problem well worth tackling. let me know if at any point you become interested in working on it.

datnamer commented 8 years ago

@mjwillson @akucukelbir @jsalvatier @twiecki Have you seen this BBVI in numpy?! https://github.com/HIPS/autograd/blob/b305f211a0db3f73ee2bed2cd6fb5ff16fe7c8df/examples/black_box_svi.py

fonnesbeck commented 8 years ago

That's really nice. I wonder how it compares to Theano performance-wise? It would be great to be able to someday free PyMC from the constraints that Theano currently places on it.

pymc-devs / pymc

Variational inference with AD? #708