Closed hvasbath closed 3 years ago
@hvasbath I really like that direction and it's pretty obvious that GPU sampling won't be fast without it. Seems like it would potentially speed up CPU sampling as well.
I suppose this would only work for pure theano based samplers like HMC, right?
Any way to prototype this?
So that would mean rewriting the random method in theano? I think there is a discussion about this somewhere...
Not necessary @junpenglao at least up to some chain length and number of variables you would produce your random samples anyways before hand all at once! This repeated calling makes things slow anyways. @twiecki and I had a discussion about this one year ago. Here: https://github.com/pymc-devs/pymc3/issues/1034 somehow this got out of sight again. I did it in SMC that way. Thats exactly the point @twiecki ! All the step methods of the samplers would need to be reimplemented with theano- which shouldnt be too difficult. So the step objects could get step.astep_theano methods or something-if you still want to have a choice which one you want to use.
@hvasbath I went back and read the issue and PRs - so the idea is the refactor the samplers so that the proposal mechanism is performed in Theano instead of repeatedly called from numpy/scipy? Is the original proposed solution of generating a fixed size random number array (as in SMC) still works?
@junpenglao thats only one part of the problem that would need to be fixed. But it could be already easily fixed in numpy as shown in the other PR - I think it got closed because no one had time to figure out why this one model was not working. The main thing is transforming the astep methods of the samplers to theano to be able to create a complete graph for the sampling from start till a defined number of draws is reached. This would allow to move everything to the GPU once in the beginning and then crunch the numbers on the GPU without comunication overhead which would result in a huge speedup of many models.
@hvasbath I've been thinking a bit about this too, mainly in the context of NUTS. If I understand correctly what you are proposing, then implementing NUTS in theano is at least a major undertaking. I think that all we can realistically hope is to avoid transfers by keeping the values and gradients on the gpu memory during each tree extension, but still only use a theano function for each leapfrog step. Maybe it would help us to see what is going on if we converted the tree extension to an iterative algorithm that preallocates the necessary memory.
Yes for sure it would be a major undertaking as it would require also replacing of _sample and _iter_sample. They would all become part of the graph. I was thinking in terms of SMC as I have no clue about NUTS ;) . But the main point of course is the same. Also because of the drastic restructuring/additional efforts that would be needed I started this, as it would require a lot of thinking and proper structuring. But in my oppinion the benefit would be huge at least for models that involve a lot of matrix multiplication etc., not to mention starting to think about multiple GPUs ... Likely we need a completely new module like gpu_sampling ...
Yes for sure it would be a major undertaking as it would require also replacing of _sample and _iter_sample. They would all become part of the graph.
Is that means the for-loop in _iter_sample
will becomes a theano.scan
?
As much as I like the thought, unless someone a lot of very smart ideas this isn't going to work for nuts.
@junpenglao exactly! Where is the problem in NUTS? Isnt it half theano already? And we dont have to implement everything right away right ;) . Rome wasnt build either in one day ;) . Then we may stick to doing that only for the metropolis methods first and then think about NUTS ...
It's basically doing a recursive version of a breadth-first-search through a tree build of the trajectory, plus some trickery so that we don't need to store the whole trajectory. It stores O(depth)
gradients and values, too. Getting all that logic into a theano.scan
is I think not an option.
We can have a look at the Edward implementation of HMC and MCMC.
Ok yeah just shortly looked into it. Might be indeed tricky to do ;) . Anyways we could still continue talking about it for the "simple" samplers.
I tried an iterative approach to NUTS last year, and it went poorly. Would
love to think about it again, but there is a lot of book keeping to do. My
current thinking is to refactor HMC into modular steps
(expand_trajectory
, check_trajectory
, etc), so that HMC, NUTS, XHMC,
and maybe iterative NUTS are just different combinations of steps.
On Tue, Jun 20, 2017, 8:12 AM Hannes Vasyura-Bathke < notifications@github.com> wrote:
Ok yeah just shortly looked into it. Might be indeed tricky to do ;) . Anyways we could still continue talking about it for the "simple" samplers.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pymc-devs/pymc3/issues/2332#issuecomment-309734156, or mute the thread https://github.com/notifications/unsubscribe-auth/ACMHEDqwdhFRbDpnxKz8fr4dnzLrcerNks5sF7cygaJpZM4N_V-T .
This is more to start discussion about optimizing pymc3 for GPU usage.
The current state as is: a function logp is compiled that is called each step with non-symbolic inputs. The RVs are return and recorded in the trace depending on the backend. This requires a tremendous calls of host_to_GPU and reverse in the sampling. I can get my GPU usage this way not more than to around 30% as there is so much transfer to and from the GPU.
However, this could be improved but may require significant restucturing-thats what I want to discuss here. The theano.function offers the "givens" and "updates" parameters that could be used to compile the whole sampling into one single function call. The "points" would need to be updated throughout the theano graph as often the next sample depends on the result of the previous sample. This way every necessary element could be moved to the GPU first prior to starting sampling and significantly reduce the transferring back and forth to the GPU.
How the recording of the steps to the backends could work would need to be discussed as well.
What are your thoughts about that? @twiecki @fonnesbeck @ColCarroll @aseyboldt etc ...