multichain/ DREAM samplers in pymc3

maahn commented 10 years ago

Is there any chance that the multichain DREAM sampler is re-implemented in pymc? There was once a version for pymc2, but its not working any more.

twiecki commented 10 years ago

These samplers haven't really taken off in the literature, I assume that their performance is not that great.

maahn commented 10 years ago

I think DREAM is quite popular in geosience. We use MC usually for inverting models to find their input parameters depending on some observations, but most of the models are rather complex so you cannot use the gradient based approaches. At the same time, the models are computationally expensive, so parallelization is important. So I think there are few alternatives to DREAM at least for this application. Or did I overlook something?

twiecki commented 10 years ago

Interesting, looking at the papers that cite the original DREAM paper (http://scholar.google.de/scholar?start=0&hl=en&as_sdt=0,5&sciodt=0,5&cites=9591032068347712586&scipsc=) there's indeed a lot of geoscience papers referencing it. Astophysicists seem to like emcee (http://dan.iel.fm/emcee/current/) which you might want to take a look at. Probably @jsalvatier would be best suited to answer how difficult it would be to get these samplers going in pymc3.

fonnesbeck commented 10 years ago

I may have someone working on this in the near future. Will keep you updated.

indiajoe commented 9 years ago

I think, multichain samplers will be really good to have when we have to walk over n-dimensional banana to converge. On how to combine emcee and pymc, googleing took me to the following post. https://twiecki.github.io/blog/2013/09/23/emcee-pymc/

fonnesbeck commented 9 years ago

We have a graduate student here at Vanderbilt, Erin Shockley, who is working on a DREAM extension for PyMC 3. I will post a link here once it is ready to be shared.

gitporst commented 8 years ago

Are there any recent updates including DREAM or comparable ensemble samplers in PyMC 3?

twiecki commented 8 years ago

There is the ATMCMC sampler. See here for an example: https://github.com/pymc-devs/pymc3/issues/1163#issuecomment-231286275

dchristle commented 7 years ago

I also wondered whether it would be possible to implement one of the DREAM samplers into PyMC3. I have worked with variants of DREAM in MATLAB and used emcee in Python for my physics research problems because my deterministic functions are such that there is no way to compute gradients analytically. One example specific to my research is numerical diagonalization of a matrix to find its eigenvalues -- I know of no way to compute a gradient for this operation efficiently. Without gradients, samplers like NUTS and HMC are a no-go for me, and I use ensemble-based samplers to solve my models. This issue of not having gradients in a model is also what @maahn mentioned. @twiecki mentioned ATMCMC, which sounds like a good start over a hand-tuned normal distribution Metropolis-Hastings sampler, but this is an old approach -- its drawbacks are what DREAM was designed to help address and are discussed in the paper:

The covariance adaptation strategy presented in [15] and used in the AM, AP, DRAM and SCAM schemes works well for relatively simple inference problems, but is inefficient and unreliable when confronted with posteriors with very heavy tails and, possibly, infinite first and second moments, and with complex posterior surfaces that contain multiple regions of attraction and numerous local optima. A single chain is unable to efficiently cope with the latter difficulties, and traverse well in pursuit of sampling the target distribution. This will be demonstrated in this paper.

I think ensemble-based samplers still have a role for those of us without efficient gradients available, but I also recognize that one of PyMC3's strengths in using Theano is exactly for computing these gradients automatically. In light of that, I wonder whether it makes sense or not to force a non-gradient-based sampler into a package that's primarily geared towards using those techniques. Any thoughts?

jsalvatier commented 7 years ago

I think DREAM would be a good fit for PyMC3. I think it should be straightforward to add. You will have to do the ensemble management yourself, but I don't think that should be too hard.

twiecki commented 7 years ago

There's a PR for ATMCMC: https://github.com/pymc-devs/pymc3/pull/1569

fonnesbeck commented 7 years ago

I have a student that I am advising who was going to go ahead and do DREAM in PyMC3, but there was not advantage to using gradients, so she chose not to. In any case, she does have a Python implementation which could serve as a starting point.

hvasbath commented 7 years ago

Yes the no-gradient available is, why I went through the pain of implementing it. I come from the geoscience comunity as well. However, I dont get the argument of @dchristle ? ATMCMC does not work on a single chain- you have to simply specify how many chains you want to sample.

dchristle commented 7 years ago

@hvasbath I made a mistake on characterizing ATMCMC relative to DREAM and its variants. When I looked around for information on ATMCMC (Automatically Tuned MCMC) earlier, it was difficult to find reports of people who had actually used the algorithm & comparisons of it to others. I only found the CRAN repository entry for it, which gave a few paragraphs of background and a high level description of the algorithm generally. It explains, "The algorithm automatically tunes the covariance matrix of the proposal distribution N(Xn, Σp) of a symmetric random walk Metropolis algorithm." As I am sure you know, that basic idea is older and goes back to "An adaptive Metropolis algorithm" by Haario, et al in 2001, and DREAM was created to deal with the deficiencies of that approach, which is why I quoted it above. The mistake I made was thinking that ATMCMC was an implementation of the original Haario et al. work, maybe with some slight tweaks, for R and now PyMC3, when it isn't, and so I apologize.

The publication I found introducing ATMCMC is relatively recent [1], and it describes the algorithm in more detail. There are a few adaptation phases -- the first is used to gain information about the rough scale of the parameters in an initial stage using the Roberts and Rosenthal 2009 adaptive Metropolis-within-Gibbs algorithm, and then a transient phase that I think is trying to get rid of some of the burn-in samples that would otherwise skew a proper estimate of the real covariance of the posterior. In this transient stage, it periodically regresses a linear function on the chain output and repeatedly tests the p-value for each coordinate to check if the chain has a definitive trend or not. Once these p-values are all above a threshold (suggesting there is no longer a trend where the chains are still converging), it finally uses these samples in a second adaptive phase where the Adaptive Metropolis of Haario et al. is used with some modification -- it runs for a while, and checks if the squared jumping distance is increasing, which indicates that the chains are still mixing, and eventually stops. Finally, it fixes the covariance distribution based on these samples, scales it, and runs to obtain the final samples. The advantage of the last step is that no adaptation is taking place anymore, and the chain generated is straightforwardly ergodic for this reason.

The authors have clearly thought a lot about the issues that plague MCMC convergence and the tricky aspects of making a practical adaptive algorithm. I was a little dismayed that there weren't "apples to apples" comparisons to some other popular algorithms out there (emcee or DREAM come to mind because I personally use them, but I'm sure people actually in the field know some others that would be worth a comparison). Specifically, an effective sample size measurement per unit time, or even a wall-clock race to convergence would be nice to see relative to other algorithms. They did show in the Discussion section that the later adaptive phases improve convergence, and that their algorithm performs better than RWM with a covariance fixed a priori to a scaled identity matrix. But I'm not sure how often RWM with an identity matrix is used in practice on "real world" problems with different variable scalings, correlations, or other tricky issues, so it left me still wanting.

I did try to find other papers that had cited that they used this algorithm, but I did not find any. Is there a particular reason you chose to implement this one versus other adaptive algorithms? I haven't used it but would be interested in comparing it to emcee/DREAM just to see how it works on my own problems.

[1] Yang, J. & Rosenthal, J.S. Comput Stat (2016). doi:10.1007/s00180-016-0682-2

hvasbath commented 7 years ago

Thanks @dchristle for your detailed explanation! However, I guess all these abreviations are not unique and you have another algorithm in mind ;) . The AT in the algorithm I implemented stands for adaptive transitional : From the help string of the function: Samples the solution space with n_chains of (adaptive) Metropolis chains, where each chain has n_steps iterations. Once finished, the sampled traces are evaluated:

(1) Based on the likelihoods of the final samples, chains are weighted
(2) the weighted covariance of the ensemble is calculated and set as new
    proposal distribution
(3) the variation in the ensemble is calculated and the next tempering
    parameter (beta) calculated
(4) New n_chains Metropolis chains are seeded on the traces with high
    weight for n_steps iterations
(5) Repeat until beta > 1.

References: Ching, J. and Chen, Y. (2007). Transitional Markov Chain Monte Carlo Method for Bayesian Model Updating, Model Class Selection, and Model Averaging. J. Eng. Mech., 10.1061/(ASCE)0733-9399(2007)133:7(816), 816-832. `link <http://ascelibrary.org/doi/abs/10.1061/%28ASCE%290733-9399 %282007%29133:7%28816%29 Minson, S. E. and Simons, M. and Beck, J. L., (2013), Bayesian inversion for finite fault earthquake source models I- Theory and algorithm. Geophysical Journal International, 2013, 194(3), pp.1701-1726

dchristle commented 7 years ago

@hvasbath Ah! :D Okay, this explains why I couldn't quite find literature on it -- I think the only ATMCMC in the literature is the one I linked and the R package 'atmcmc', and in the papers you linked they call it TMCMC and CATMIP. It seems like a neat approach - it looks very related to Sequential Monte Carlo/particle filter/path sampling type ideas, and I've wondered in the past how well they might work in practice for offline analysis (versus the on-line analysis or Bayesian model evidence computations they are often seen in). I hadn't seen a freely available implementation in MATLAB or Python for them in the offline setting. Neat!

fonnesbeck commented 6 years ago

Can we let the DREAM die and close this, or is there ongoing interest? Seems like we have alternatives in place for doing similar things.

hvasbath commented 6 years ago

I guess so. Maybe if someone else stumbles upon it later: in pymc3 there is the SMC sampler a sequential monte carlo algorithm, who works very much similar to the DREAM sampler.

pymc-devs / pymc

multichain/ DREAM samplers in pymc3 #585