PyMC3 Sampling code not utilizing GPU cores

sivasubbub commented 8 years ago

Hi,

I am trying to implement MCMC using PyMC3. While trying to execute the program, the sampling part took more time to complete. I don't see any gpu utilization ( 0% ) via nvidia-smi command.

trace = sample(5000, start=start, njobs=4)

In the below code, I configured no. of chain is 4, but it utilize 4 cpu cores instead gpu while running the sampling part.

How to make gpu utilization effectively, invoking PyMC3 sampling ?

import numpy as np
from pymc3 import Model, Normal, HalfNormal, summary, find_MAP, sample
 np.random.seed(123)
alpha, sigma = 1, 1
beta = [1, 2.5]
size = 10000
X1 = np.random.randn(size)
X2 = np.random.randn(size) * 0.2
Y = alpha + beta[0]*X1 + beta[1]*X2 + np.random.randn(size)*sigma
basic_model = Model()
with basic_model:
    alpha = Normal('alpha', mu=0, sd=10)
    beta = Normal('beta', mu=0, sd=10, shape=2)
    sigma = HalfNormal('sigma', sd=1)
    mu = alpha + beta[0]*X1 + beta[1]*X2
    Y_obs = Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
    start = find_MAP()
    trace = sample(5000, start=start, njobs=4) 
summary(trace)

twiecki commented 8 years ago

Yes, most of the sampling code isn't written in theano. Some places like NUTS have parts but not otherwise. That's what would be required to run it on the GPU. The model, however, should be able to run on the GPU, do you find that?

brandonwillard commented 8 years ago

@twiecki, I was just wondering about that. I noticed that Theano isn't used throughout the Distribution code (I wanted to make a Distribution.shape parameter symbolic), nor the Distribution.random logic. Are there some fundamental limitations that prevent more Theano use?

I can understand that using samplers from other libraries might be too much work to Theano-ize, but Theano seems to cover a few basic ones itself, and some in PyMC3 are manually implemented with a uniform sampler (e.g. Wald).

Also, the current Distribution.random framework has to manually work out the arguments to--and compile-- [potentially] numerous Theano functions (when collapsing the conditional dependency tree) and cache those. Looks like it must, because it's jumping in and out of the Theano world (e.g. sample through non-Theano means, compile a Theano function for connecting Deterministics, evaluate at the sample, repeat). Seems like most, if not all, of this could be handled by a single pure Theano sampling function, and possibly clever use of the givens parameter. No?

twiecki commented 8 years ago

You're right. The random sampling isn't part of the inference though so I don't think it's a big problem for it to not live inside theano. Unless the underlying code could be much simpler when done in Theano.

hvasbath commented 8 years ago

@sivasubbub You need to specify in your theano.rsc in the home directory device=gpu more about that here: http://deeplearning.net/software/theano/install.html#install

You have to make sure that all the numbers you are pushing around are float32 or below otherwise it will fall back on the CPU.

twiecki commented 8 years ago

I would love a successful example of this, specifically, running the bayesian_network on the GPU would be cool: https://github.com/pymc-devs/pymc3/blob/master/docs/source/notebooks/bayesian_neural_network_advi.ipynb

hvasbath commented 8 years ago

As I am going to write a proposal on using bayesian neural networks in my field I will work on that in the near future- from September on. If you dont need it fast I can cover this. Thanks for providing this example by the way. It is great!

bburan commented 8 years ago

Just checking in on the status of GPU support in PyMC3. I am using the current dev branches of Theano and PyMC3.

I took one of the examples listed under the PyMC3 documentation and ran it while monitoring GPU utilization using watch -n 0.1 nvidia-smi. At no point did the GPU utilization jump above 0%. When I ran a different script for testing Theano on the GPU, I saw the utilization jump to ~30% or so. This suggests that PyMC3 is not using the GPU at all (either in NUTS or in the model evaluation).

PyMC3 model:

import theano
import numpy as np
from pymc3 import  *
import numpy as np

theano.config.floatX = 'float32'
theano.config.compute_test_value = 'raise'
theano.config.exception_verbosity= 'high'

size = 200
true_intercept = 1
true_slope = 2

x = np.linspace(0, 1, size)
true_regression_line = true_intercept + true_slope * x
y = true_regression_line + np.random.normal(scale=.5, size=size)

x = x.astype('float32')
y = y.astype('float32')

data = dict(x=x, y=y)

with Model() as model:
    sigma = HalfCauchy('sigma', beta=10, testval=1.)
    intercept = Normal('Intercept', 0, sd=20)
    x_coeff = Normal('x', 0, sd=20)

    likelihood = Normal('y', mu=intercept + x_coeff * x,
                        sd=sigma, observed=y)

    start = find_MAP()
    step = NUTS(scaling=start)
    trace = sample(20000, step, start=start, progressbar=True)

Theano test code:

from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

~

twiecki commented 8 years ago

I started a branch here: https://github.com/pymc-devs/pymc3/pull/1338

The reason is that pymc3 forces the dtype (which the branch relaxes). Unfortunately it caused some other problems. However, it would be helpful if you could confirm that you can utilize the GPU when using that branch.

twiecki commented 8 years ago

pip install -U --no-deps git+https://github.com/pymc-devs/pymc3@use_theano_float_type2

bburan commented 8 years ago

I confirmed that it uses the GPU. However, the example actually runs much more slowly on the GPU than the CPU. I tried switching to the Metropolis and got an error (I assume that that's one of the other problems you were referring to). I assume the slowdown has to do with switching between the CPU for sampling and GPU for model evaluation?

twiecki commented 8 years ago

@bburan Interesting, thanks for sharing. What model have you used? My hunch is that multivariate models with e.g. dot-products should have the most potential for speed. Maybe try the neural network example?

Making progress on this front and figuring out why it's slow would be a major contribution.

hvasbath commented 8 years ago

For sure, as the step methods are still done on the CPU most of the slowdown will be caused by copying data from and to the GPU. One would need to change that in order to have everything running on the GPU. That would require rewriting the step methods with theano operators, or one would could to make additional step_GPU methods. And depending on the theano.config.device flag that is being used one could select either step methods...once the float32/64 variable issue is solved the old step method would become obsolete.

twiecki commented 8 years ago

NUTS is already partly moved to theano, specifically the leapfrog function is. buildtree should be possible at all.

Intuitively, it's not clear to me why having the sampler on the CPU and model on GPU is expected to cause such a slowdown. The model logp, logp gradient and data should all be on the GPU. The only thing the sample does is provide a new parameter point to evaluate, that's a very small thing to copy. Perhaps is worth double checking that the data in the observeds is indeed stored on the GPU and not copied. We should also check if ADVI produces slower or faster results.

In any case, having the step methods (at least nuts and advi) in theano would be great. I think @brandonwillard also mentioned this.

@taku-y any thoughts on having ADVI run in theano? My feeling is that a lot of it is already moved to theano.

ColCarroll commented 8 years ago

I've got a branch on my fork where I'm working to implement some of the ideas from the Exhaustive Hamiltonian Monte Carlo paper. Largely that means factoring out different methods for building trajectories, validating trajectories, and terminating trajectories, and then implementing HMC, NUTS and XHMC as special cases.

There hasn't been tons of progress, but get_theano_hamiltonian_functions might be worth looking at.

twiecki commented 8 years ago

@ColCarroll Looks like a promising start!

twiecki commented 7 years ago

@bburan Can you run your model on the GPU using ADVI and compare the speed? ADVI is in theano so potential for speed-ups is higher.

ramanujasimha commented 7 years ago

I am not sure if you are still interested in knowing the run-time of Bayesian Neural Network (using ADVI) on a GPU. I installed the dev version using the above command and ran the net - on a GPU. The run-time (for the "Average ELBO" step) is pretty much the same as you report on your laptop - around 24-25 seconds. I also noticed that the GPU was not being used. I am not sure if I am missing setting some parameters for pymc.

twiecki commented 7 years ago

@Spaak apparently got it to work, not sure if he had to change anything. Also can you try on this branch: https://github.com/pymc-devs/pymc3/pull/1690

ramanujasimha commented 7 years ago

Thanks. I installed the newer unify_float_type. But no change though.

Interesting that when I run the sample code listed by @bburan above (using NUTS), I see that the GPU is used (upto ~55%). However, when I run Bayesian neural network code (using ADMI), the GPU is unused.

twiecki commented 7 years ago

Are you sure the model is float32?

ramanujasimha commented 7 years ago

Thanks for the tip. Initially, I did use float32 and had an error. After that I made changes to the installation as per you suggestion and along with that unintentionally used another version of the code with float64.

In any case, when I use float32 - I get the following error. Any insights? It seems to be pointing to theano and then pymc3 files. I am wondering if this has anything to do with how the underlying theano function is called (http://deeplearning.net/software/theano/library/compile/function.html). BTW - I do still use Python 2.7 even though it is recommended to use 3.4.

Traceback (most recent call last): File "bayesian_neural_network.py", line 47, in testval=init_1) File "deeplearn/local/lib/python2.7/site-packages/pymc3/distributions/distribution.py", line 35, in new return model.Var(name, dist, data, total_size) File "deeplearn/local/lib/python2.7/site-packages/pymc3/model.py", line 497, in Var total_size=total_size, model=self) File "deeplearn/local/lib/python2.7/site-packages/pymc3/model.py", line 762, in init self.logp_elemwiset = distribution.logp(self) File "deeplearn/local/lib/python2.7/site-packages/pymc3/distributions/continuous.py", line 231, in logp return bound((-tau * (value - mu)**2 + tt.log(tau / np.pi / 2.)) / 2., File "deeplearn/local/lib/python2.7/site-packages/theano/tensor/var.py", line 154, in sub return theano.tensor.basic.sub(self, other) File "deeplearn/local/lib/python2.7/site-packages/theano/gof/op.py", line 621, in call storage_map[ins] = [self._get_test_value(ins)] File "deeplearn/local/lib/python2.7/site-packages/theano/gof/op.py", line 549, in _get_test_value ret = v.type.filter(v.tag.test_value) File "deeplearn/local/lib/python2.7/site-packages/theano/tensor/type.py", line 139, in filter raise TypeError(err_msg, data) TypeError: For compute_test_value, one input test value does not have the requested type.

The error when converting the test value to that variable type: TensorType(float32, matrix) cannot store a value of dtype float64 without risking loss of precision. If you do not mind this loss, you can: 1) explicitly cast your data to float32, or 2) set "allow_input_downcast=True" when calling "function".

twiecki commented 7 years ago

Are you up-to-date on most recent master (I merged the branch) and did you set theano.conflig.floatX to float32? Also, make sure that the test_values are float32.

ferrine commented 7 years ago

I had the same problem recently. Using float32 turned out into overflow and thus nans. What about using floatX just everywhere to avoid such problems?

ramanujasimha commented 7 years ago

I set both floatX=float32 in ~/.theanorc file and theano.config.floatX = 'float32' in the code file. FYI - I have not seen this error with any of my other theano code as well as with others where I use high-level deep learning libraries.

I will check version of the installation and get back.

twiecki commented 7 years ago

@ferrine I tried to use floatX where necessary. @ramanujasimha Setting warn_float64 = warn in .theanorc can tell you where the problems are.

ramanujasimha commented 7 years ago

Thank you for all the input. I got the Bayesian neural network to work on the GPU.

The GPU run-time -- of 21-22 seconds -- is very close to the CPU run-time -- of 24-25 seconds. I expected a higher speed-up. The max GPU utilization I noticed is ~80%. BTW - compute capability of my GPU is 5.2.

ramanujasimha commented 7 years ago

Using numpy, I explicitly set types of all variables to be 32-bit.

twiecki commented 7 years ago

Can you post the code here?

On Jan 27, 2017 12:12 AM, "Ramanuja Simha" notifications@github.com wrote:

Using numpy, I explicitly set types of all variables to be 32-bit.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pymc-devs/pymc3/issues/1073#issuecomment-275543578, or mute the thread https://github.com/notifications/unsubscribe-auth/AApJmHEVpn3CiSHxUv67KdEZQwqZ776Mks5rWShHgaJpZM4IL5Dt .

ramanujasimha commented 7 years ago

I used code from this link:

http://pymc-devs.github.io/pymc3/notebooks/bayesian_neural_network_advi.html

The only changes I made were:

In [2]:

X = np.array(X, dtype='float32') Y = np.array(Y, dtype='int32')

In [4]:

init_1 = np.random.randn(X.shape[1], n_hidden).astype(np.float32) init_2 = np.random.randn(n_hidden, n_hidden).astype(np.float32) init_out = np.random.randn(n_hidden).astype(np.float32)

twiecki commented 7 years ago

Thanks, should change that example to use theano.config.floatX.

twiecki commented 7 years ago

Closing this as sampling seems to utilize GPU cores now.

mortonjt commented 7 years ago

@twiecki this is pretty exciting. Are there posted GPU/CPU benchmarks in PyMC3? Definitely curious about the performance boosts!

twiecki commented 7 years ago

Me too :). No posted benchmarks yet but let me know if you want to help with that.

pymc-devs / pymc

PyMC3 Sampling code not utilizing GPU cores #1073