odysszis / AML

Repository for kaggle projects for the module Applied Machine Learning
0 stars 0 forks source link

Investigate GPU Theano GPU requirements #5

Open PeadarOhAodha opened 8 years ago

PeadarOhAodha commented 8 years ago

Document for the group whats required (CUDA etc..?) for running Theano with GPU.

MaxHoefl commented 8 years ago

CUDA backend for Theano: If your graphics card is not listed on this site: https://developer.nvidia.com/cuda-gpus you will not be able to use CUDA. This is the case for me (Intel HD 6000) so I have to use:

GPUarray (OpenCL) backend for Theano: We first need the GPUarray library. This site provides very simple instructions on how to do so: http://deeplearning.net/software/libgpuarray/installation.html

Once that's done you can see whether your GPU is used running this code in python

from theano import function, config, shared, tensor, sandbox
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
              ('Gpu' not in type(x.op).__name__)
              for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

The crucial line is x = theano.shared(numpy.asarray(rng.rand(vlen), config.floatX)) The first argument of the shared variable constructor is value (here, a numpy array or random numbers) and the second config.floatX is necessary so that the code is runnable on the GPU (this is not clear to me because in the .config.floatX documentation it only says: this "sets the default theano bit width for arguments passed as Python floating-point numbers." Don't know why that's necessary for using GPU..)

Memory aliasing in Theano

Here are some cool facts about how Theano allocates memory (as far as I have understood it reading http://deeplearning.net/software/theano/tutorial/aliasing.html#borrowing-when-creating-shared-variables): Using CPU: Whenever a theano.shared variable is constructed it gets a copy of the value argument. Example:

import numpy, theano
np_array = numpy.ones(2, dtype='float32')

s_default = theano.shared(np_array) # the constructor gets a copy of np_array
s_false   = theano.shared(np_array, borrow=False) #  the constructor gets a copy of np_array
s_true    = theano.shared(np_array, borrow=True) # the constructor gets a pointer to np_array

Any changes to np_array will neither affect s_default nor s_false but it will affect s_true. One can speed up considerably using the borrow=True flag when the variables that are passed to the shared variable are very large. However, the **borrow=True**flag only works under CPUs as the device.**

Using GPU: This aliasing between s_true and np_array cannot occur under GPUs because Theano manages its own memory space and hence does not have the internal representation of np_array (check the validity of this argument please). Where we can harvest speed using a GPU is using the borrow=True flag in Theano.function: x = theano.tensor('x') y = theano.exp(x) f = theano.function([theano.In(x, borrow=True)], theano.Out(y, borrow=True)) Here, Theano does not create a new temporary storage for x but reuses the input as buffer. That is, x would be overwritten if the function would mess around with the input before returning the output. The same is true for the output. Theano reuses the output as buffer whenever the function is called. The speed advantage because immerdiately apparent when we have a long for loop which recalculates f for different inputs (makes huge difference).

This is it for now, please correct if I misunderstood something and read http://deeplearning.net/software/theano/tutorial/aliasing.html#borrowing-when-creating-shared-variables