Parallel Tempering in multiple threads?

I left a comment on Gelman's blog about this but I realized this is the best place to keep track of feature requests. Here's a copy of that request:

Speaking of mixing, Geyer’s parallel tempering seems like a pretty useful technique. Any chance of including it into Stan so that Stan will run NUTS on 2 or 3 additional tempered posterior densities, with the tempering coefficients given as an array or something? If you do switching with lowish probability then the chains may be mostly able to run in parallel threads on multiple cores without a lot of synchronization overhead and this could be a fantastic way to improve exploration of the space at basically no wall-clock-time cost.

Specifically, I imagine something like the following method:

You set up your N temperatures, maybe something like

stan_parallel_temps <- [1.25,1.5,3]; ## you always need 1 to be first, so perhaps it's better to just declare the additional temps and let stan put 1 at the front of this list on its own.

during warmup Stan estimates the average length between U turns in the untempered distribution internally. It then chooses a number N at random exponentially distributed (or maybe gamma distributed or with a distribution the user specifies in the model file) with mean equal to some constant times this average inter-u-turn length (ceiling to the nearest int).

All the chains then run N steps and synchronize on a thread semaphore so that when all the high temp threads are done with N steps the temp 1 thread can proceed. It proceeds by choosing a random adjacent pair of temperatures and attempting an exchange between those states, and then generating a new N and setting all the threads back to work to do N HMC timesteps.

in this scheme you most of the time complete several HMC trajectories before trying to exchange, so you don't disrupt the NUTS sampler too much, but then you have several tempered distributions running in parallel, and feeding your untempered simulation new regions of space so you may be able to explore space more readily. And as I said, with several cores your wall clock overhead is only due to the synchronization, which shouldn't be too bad.

I had an implementation of parallel tempering in Stan which worked in serial a few months back. I had even started implementing a parallel version using MPI, but that never really got finished. Anyway, there are some major API changes happening with the samplers right now. Once those have stabilized I'm planning to get back to the implementation.

As a side note, I'm not certain threads are the best way to do it. The way memory works with autodiff in stan means there's a relatively large performance hit when threads are enabled. This was one of the reasons I was looking at using MPI, which would allow for easy distribution to multiple machines if needed.

Whatever way makes it relatively easy and low overhead to implement, I'm for it! In general, I am very surprised that JAGS and others haven't used either threads, or MPI or some kind of parallelization right from the start with tempering a very obvious choice for improving things using multiple cores.

as an aside, there is some fanciness required for parallel RNGs i guess the Lecouyer RNG is designed for this, but in general it would be nice for Stan to be smart about parallel runs with appropriate RNG initialization etc.

On 5/9/13 2:21 PM, dlakelan wrote:

Whatever way makes it relatively easy and low overhead to implement, I'm for it! In general, I am very surprised that JAGS and others haven't used either threads, or MPI or some kind of parallelization right from the start with tempering a very obvious choice for improving things using multiple cores.

One of the big obstacles to threading is the lack of threading in the language itself. We've been considering using the Boost threads, but haven't even gotten as far as figuring out how to install them (unlike most of Boost, they're not header only).

Stan's also designed so that models are thread safe. The tricky part is the reverse-mode automatic differentiation, which uses static data arenas. These need to be made thread local. We have that memory model implemented, but haven't extensively tested the contention overhead under heavy threading. Even with one thread there's some overhead from thread-locality.

as an aside, there is some fanciness required for parallel RNGs

We use this even for chains run in quick succession, which we don't want to seed with consecutive clock-based initializations.

Stan implements jump-ahead of the RNGs based on the chain ID passed to the command. The Boost API for RNGs supports this function.

i guess the Lecouyer RNG is designed for this, but in general it would be nice for Stan to be smart about parallel runs with appropriate RNG initialization etc.

We're using an older L'Ecuyer RNG that implements the jump efficiently

Ben's been urging us to use the later L'Ecuyer RNG, which Boost does not implement (though other packages do implement it).

AES encryption is an extremely high quality random number generator (choose a random 256 bit key, then encrypt an infinite string of zero bytes in cipher block chain mode). Generally this type of random number generator is a lot slower than the kinds used for monte carlo, but the Intel i3/i5/i7 chips supposedly have specialized instructions to do this AES encryption providing up to 700 million random bytes per second in a single thread (according to http://en.wikipedia.org/wiki/Advanced_Encryption_Standard)

Again, if you have threads, one thread can fill up a buffer of random numbers in parallel. when you finish using the first buffer, you swap with the second buffer, tell the filler to fill the old buffer in parallel, and hand out numbers from the new buffer.

all this is to say that with multiple core computers there are definitely opportunities to do lightweight threading that can be advantageous for this kind of big scientific computation, and so I'm glad to hear you guys are being thread conscious even if you're not currently using threads.

(Note: i just looked around a bit and it turns out the recent Ivy Bridge chips have a special random instruction RdRand so that's even nicer, though not reproducible by design. http://en.wikipedia.org/wiki/RdRand )

We usually run four chains in parallel processes, so at least on the hardware I have, that uses up all the cores. We want to be careful not to add threading within chains that may slow the overall runs down. Of course, we could just run the chains sequentially.

Another issue is running on multiple machines in a cluster. I don't know how easy it would be to run parallel termpering in that kind of environment. You'd want exchanges of information to be even less frequent.

We still haven't chosen a threading library. Any suggestions?

RNG hasn't been a measurable component of the time a chain takes, so parallelizing it is unlikely to help much with overall speed.

Does AES encryption encoding skip ahead appropriately? I'm still not sure skipping is such a big deal, but right now, we use a linear congruential RNG that lets us skip ahead efficiently. There's been talk of going to a more modern L'Ecuyer RNG, but there's no implementation built into Boost, and we've been reluctant to add new dependencies.

AES encryption is not a standard type of random number generator, I'm not sure skipping ahead in the sequence is something you need to do, rather you could use separate keys to generate separate streams. Although this is unattractive when you have something like congruential or other simple mathematical generators because the statistical properties can be compromised, encryption algorithms are designed to a much higher level of statistical independence, namely you shouldn't be able to deduce the key from knowing pretty much any practical number of encrypted cyphertexts with known plain text. In addition, without the key you shouldn't be able to deduce anything about the plaintext, even if you know almost all of it. In other words, parallel streams should be very much independent. In particular imagine generating a stream of random numbers by taking a 256 bit random key and a 1024 byte seed, and encrypting the message {seed 00000...} with the key in cipher block chaining (CBC) mode. There are 2^8448 possible streams. Don't like that? Make a 2048 byte seed, there are now 2^16640 possible streams. The algorithm is designed so that even if you use the same key on an enormous number of streams and you know they all start with 1024 random bytes and are followed by an infinite number of zeros, you shouldn't be able to deduce the first 1024 bytes.

Because of this, encryption algorithms produce very high quality random numbers, but are usually slower because such a high requirement for independence of streams requires a lot of steps to mix up the bits. Having hardware accelerated instructions though could make this less of an issue. OpenSSL would probably be the obvious dependency, and it works on linux, windows, and mac and includes hardware acceleration supposedly but I haven't actually used it myself.

I don't know much about threading libraries in C++ these days. I started using threads under BeOS back in the day, and then when it fizzled I used to simply use pthreads in raw form. These days I do data analysis with R and such so I haven't had much need for C++ level threading.

Typically when building a threaded application I would think about what the tasks were and try to break out into separate threads things that were fairly independent. A typical "pattern" is to have a queue that holds work to be done, and a queue that holds results. N threads wait on the to-be-done queue, when something enters it a thread pulls it out, does it, and shoves the results on the finished work queue.

Skimming the paper on NUTS on ArXive I can see that it uses a step in which recursion is used to build trees of transitions or something like that. I can imagine parallelizing this by having BuildTree parallellize the left and right subtrees by pushing the associated work into a queue and having a worker thread pull from that queue and push the results into a result queue... something like that. The advantage would be that you'd be carrying out N leapfrogs in parallel (where N is the number of worker threads), and most of your time is probably spent in the gradient calculations inside leapfrog right?

The construction of the tree in NUTS can be parallelized by, at most, a factor of two as the simulation is done forwards and backwards. The tree is not really a computational construct but rather a theoretical one.

On Thu, Jul 11, 2013 at 1:44 PM, dlakelan notifications@github.com wrote:

AES encryption is not a standard type of random number generator, I'm not sure skipping ahead in the sequence is something you need to do, rather you could use separate keys to generate separate streams. Although this is unattractive when you have something like congruential or other simple mathematical generators because the statistical properties can be compromised, encryption algorithms are designed to a much higher level of statistical independence, namely you shouldn't be able to deduce the key from knowing pretty much any practical number of encrypted cyphertexts with known plain text. In addition, without the key you shouldn't be able to deduce anything about the plaintext, even if you know almost all of it. In other words, parallel streams should be very much independent. In particular imagine generating a stream of random numbers by taking a 256 bit random key and a 1024 byte seed, and encrypting the message {seed 00000...} with the key in cipher block chaining (CBC) mode. There are 2^8448 possible streams. Don't like that? Make a 2048 byte seed, there are now 2^16640 possible streams. The algorithm is designed so that even if you use the same key on an enormous number of streams and you know they all start with 1024 random bytes and are followed by an infinite number of zeros, you shouldn't be able to deduce the first 1024 bytes.

Because of this, encryption algorithms produce very high quality random numbers, but are usually slower because such a high requirement for independence of streams requires a lot of steps to mix up the bits. Having hardware accelerated instructions though could make this less of an issue. OpenSSL would probably be the obvious dependency, and it works on linux, windows, and mac and includes hardware acceleration supposedly but I haven't actually used it myself.

I don't know much about threading libraries in C++ these days. I started using threads under BeOS back in the day, and then when it fizzled I used to simply use pthreads in raw form. These days I do data analysis with R and such so I haven't had much need for C++ level threading.

Typically when building a threaded application I would think about what the tasks were and try to break out into separate threads things that were fairly independent. A typical "pattern" is to have a queue that holds work to be done, and a queue that holds results. N threads wait on the to-be-done queue, when something enters it a thread pulls it out, does it, and shoves the results on the finished work queue.

Skimming the paper on NUTS on ArXive I can see that it uses a step in which recursion is used to build trees of transitions or something like that. I can imagine parallelizing this by having BuildTree parallellize the left and right subtrees by pushing the associated work into a queue and having a worker thread pull from that queue and push the results into a result queue... something like that. The advantage would be that you'd be carrying out N leapfrogs in parallel (where N is the number of worker threads), and most of your time is probably spent in the gradient calculations inside leapfrog right?

— Reply to this email directly or view it on GitHubhttps://github.com/stan-dev/stan/issues/39#issuecomment-20828719 .

I'm not familiar with the internals of stan enough to give specifics of how to make it more parallel via threading. Perhaps running parallel chains in separate processes is as good a plan as any from a practical perspective. The only problem with that is it essentially gives you N times as many samples in 1 fixed time unit. In a few years perhaps we have 8 or 16 cores in typical workstations, we don't necessarily want to run 16 chains, we might prefer say 4 chains but in 1/2 or 1/4 the time.

Perhaps a "better" way to do thread level parallelization is in the log probability gradient calculation to use the linearity of the derivative to differentiate term by term in parallel? I don't know if that's possible in your code, but it seems likely to be a reasonable strategy again using a queue to push differentiation operations to be carried out by workers who push results onto a finished work queue. The main thread would wait until no more terms were in the work queue, and then pop all the finished terms and add them up.

Thanks for the detail on encryption-based RNGs.

As to parallelizing HMC/NUTS, the time is indeed all spent in gradient calcs. At least with reverse-mode auto-diff, it's difficult to parallelize, though probably not impossible with a radical rethink of how the memory gets allocated and deallocated. Right now it's a custom arena and the topological sort of the expression graph is a byproduct of its being constructed serially in order and the reverse sweep to propagate the chain rule backwards through the expression graph is serial. All that's technically required is that every expression's superexpression is evaluated first. So graph-coloring algorithms could be used to both find independent modules in building it and in evaluating it. Shared expressions in the graph need to be synch points.

For NUTS tree building, a typical order would be 1-step left, 2 steps left, 4 steps right, 8 steps right, 16 steps right, 32 steps left. Stop. You don't know how far to go before stopping ahead of time and you don't know which direction the next doubling should be in. So I don't see how to go both ways in parallel without wasting a ton of work on the last step.

We do want to look at things like ensemble methods, which are very easy to parallelize, but also very compute intensive to start with. And they don't involve gradients, which is good.

There's the obvious parallelization of chains, and the less obvious approach we've been looking at of controlling by time. So we would use something like a queue to do iterations of NUTS on multiple chains, but each chain would evolve independently and they wouldn't necessarily all run to the same time. Right now, what can (and often does) happen is that a single chain will be slow. If we can do earlier stopping, perhaps with cross-chain communication for adaptation, etc., it'd be a big win.

On 7/11/13 4:36 PM, Marcus Brubaker wrote:

The construction of the tree in NUTS can be parallelized by, at most, a factor of two as the simulation is done forwards and backwards. The tree is not really a computational construct but rather a theoretical one.

On Thu, Jul 11, 2013 at 1:44 PM, dlakelan notifications@github.com wrote:

AES encryption is not a standard type of random number generator, I'm not sure skipping ahead in the sequence is something you need to do, rather you could use separate keys to generate separate streams. Although this is unattractive when you have something like congruential or other simple mathematical generators because the statistical properties can be compromised, encryption algorithms are designed to a much higher level of statistical independence, namely you shouldn't be able to deduce the key from knowing pretty much any practical number of encrypted cyphertexts with known plain text. In addition, without the key you shouldn't be able to deduce anything about the plaintext, even if you know almost all of it. In other words, parallel streams should be very much independent. In particular imagine generating a stream of random numbers by taking a 256 bit random key and a 1024 byte seed, and encrypting the message {seed 00000...} with the key in cipher block chaining (CBC) mode. There are 2^8448 possible streams. Don't like that? Make a 2048 byte seed, there are now 2^16640 possible streams. The algorithm is designed so that even if you use the same key on an enormous number of streams and you know they all start with 1024 random bytes and are followed by an infinite number of zeros, you shouldn't be able to deduce the first 1024 bytes.

Because of this, encryption algorithms produce very high quality random numbers, but are usually slower because such a high requirement for independence of streams requires a lot of steps to mix up the bits. Having hardware accelerated instructions though could make this less of an issue. OpenSSL would probably be the obvious dependency, and it works on linux, windows, and mac and includes hardware acceleration supposedly but I haven't actually used it myself.

I don't know much about threading libraries in C++ these days. I started using threads under BeOS back in the day, and then when it fizzled I used to simply use pthreads in raw form. These days I do data analysis with R and such so I haven't had much need for C++ level threading.

Typically when building a threaded application I would think about what the tasks were and try to break out into separate threads things that were fairly independent. A typical "pattern" is to have a queue that holds work to be done, and a queue that holds results. N threads wait on the to-be-done queue, when something enters it a thread pulls it out, does it, and shoves the results on the finished work queue.

Skimming the paper on NUTS on ArXive I can see that it uses a step in which recursion is used to build trees of transitions or something like that. I can imagine parallelizing this by having BuildTree parallellize the left and right subtrees by pushing the associated work into a queue and having a worker thread pull from that queue and push the results into a result queue... something like that. The advantage would be that you'd be carrying out N leapfrogs in parallel (where N is the number of worker threads), and most of your time is probably spent in the gradient calculations inside leapfrog right?

— Reply to this email directly or view it on GitHubhttps://github.com/stan-dev/stan/issues/39#issuecomment-20828719 .

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/39#issuecomment-20840422.

On 7/11/13 8:40 PM, dlakelan wrote:

I'm not familiar with the internals of stan enough to give specifics of how to make it more parallel via threading. Perhaps running parallel chains in separate processes is as good a plan as any from a practical perspective. The only problem with that is it essentially gives you N times as many samples in 1 fixed time unit. In a few years perhaps we have 8 or 16 cores in typical workstations, we don't necessarily want to run 16 chains, we might prefer say 4 chains but in 1/2 or 1/4 the time.

Absolutely. And we're already there on decent desktop workstations.

Adaptation is one area where working in parallel may help, but overall, this won't lead to a huge speedup because adaptation is already pretty fast.

Everyone also keeps urging us to look at cluster computing, which is a challenge to do with anything other than multiple chains.

At the other extreme, using GPUs is also hard given the way auto-diff works, but we're working on breaking it down into more component ops to at least get the advantages of SSE loop unfolding in some of our deeper loops. Right now, it's a matrix of structs that interferes with basic compiler optimization.

Going back the other way, we may also be wanting to run multiple models at once, which affords more opportunities for parallelism. That's easy in theory, but we'd like to be able to do model exploration somehow.

I think the operation here's fast enough that a synched queue would be a major bottleneck. Instead, we could do some kind of static analysis of the expression graph to figure out what can be done in parallel.

With reverse-mode autodiff, the result with respect to an expression needs to be calculated before its subexpressions are considered. With forward mode, it's the other way around (i.e., the usual way, where you finish all subexpressions before a superexpression).

On 7/11/13 8:48 PM, dlakelan wrote:

Perhaps a "better" way to do thread level parallelization is in the log probability gradient calculation to use the linearity of the derivative to differentiate term by term in parallel? I don't know if that's possible in your code, but it seems likely to be a reasonable strategy again using a queue to push differentiation operations to be carried out by workers who push results onto a finished work queue. The main thread would wait until no more terms were in the work queue, and then pop all the finished terms and add them up.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/39#issuecomment-20852409.

I'm going to close this for now as we have no immediate plans for multithreading, let alone trying to figure out a robust way of doing parallel tempering (especially when we can adiabatically flow, instead).

stan-dev / stan

Parallel Tempering in multiple threads? #39