hidasib / GRU4Rec

GRU4Rec is the original Theano implementation of the algorithm in "Session-based Recommendations with Recurrent Neural Networks" paper, published at ICLR 2016 and its follow-up "Recurrent Neural Networks with Top-k Gains for Session-based Recommendations". The code is optimized for execution on the GPU.
Other
747 stars 222 forks source link

code not faster on GPU #7

Closed ghost closed 6 years ago

ghost commented 7 years ago

Hi,

Great article and code! I find that the training is not faster on GPU (Titan X) vs CPU (Macbook Pro) The train_function will take, for example, 0.003 s on GPU and 0.0015 on CPU. Did you ever encounter this problem.

Thanks, Massimo

hidasib commented 7 years ago

I would be suspicious with those results. Those training times seem to be extremely low. How much data do you use for training? Do you get any errors?

In practice, training is much faster on GPU than on CPU. There are two bottlenecks on GPU at the moment, but neither hinder the execution so much that it would slow below the training speed of a CPU.

frederickayala commented 7 years ago

Have you verify that theano is configured properly? Check your .theanorc and you can validate if theano is using the GPU with theano.config.device

http://deeplearning.net/software/theano/library/config.html

loretoparisi commented 7 years ago

@hidasib Do you have specific training time for different configurations/gpu units?

The paper only states that

The running time depends on the parameters and the dataset. Generally speaking the difference in runtime between the smaller and the larger variant is not too high on a GeForce GTX Titan X GPU and the training of the network can be done in a few hours. On CPU, the smaller network can be trained in a practically acceptable timeframe.

and

The GRU-based approach has substantial gain over the item-KNN in both evaluation metrics on both datasets, even if the number of units is 100. Increasing the number of units further improves the results for pairwise losses, but the accuracy decreases for cross-entropy... Although, increasing the number of units increases the training times, we found that it was not too expensive to move from 100 units to 1000 on GPU.

A note on theano specifies that

Using Theano with fixes for the subtensor operators on GPU

I'm running via nvidia-docker on a

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   37C    P8    17W / 125W |      2MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The training process

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                   
  131 root      20   0 32.726g 3.766g  39908 R 399.3 25.6   4357:04 python                                                                                                    

I beat that I'm running on CPU. Printing Theano configuration attributes will reveal it:

python -c 'import theano; print(theano.config)' | less
loretoparisi commented 7 years ago

[UPDATE]

Ok, I have figured out. First create a .theanorc file in $HOME, with this minimal configuration attributes

[global]
floatX = float32
device = cuda0

[lib]
cnmem = 1

[nvcc]
fastmath = True

Then to check it out write this python script

from theano import function, config, shared, tensor
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
              ('Gpu' not in type(x.op).__name__)
              for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

and test if the device is detected:

root@d842fc00a358:~/GRU4Rec/examples/rsc15# /root/yes/lib/python3.5/site-packages/theano/gpuarray/dnn.py:135: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to version 5.1.
  warnings.warn("Your cuDNN version is more recent than "
taiUsing cuDNN version 6021 on context None
lMapped name None to device cuda0: GRID K520 (0000:00:03.0)

In my case I can see a warning about cuDNN but this depends on its version. If the gpu device has been detected, since your configuration states device = cuda0, you can restart the training and see what happens. I get a segmentation fault in few minutes, so it's possibile due to the previous warning...

loretoparisi commented 6 years ago

I have reported the segmentation fault to Theano, since the training on cpu it works, so it maybe due to something else.