Closed rakeshvar closed 8 years ago
According to my experiments, even if the training data is shared variables, it seems that it is still slower on GPUs than on CPUs! Do you have time to optimize it on GPUs?
I too tried moving the variable to GPU. But the bug is in Theano itself. [https://github.com/Theano/Theano/issues/1168] So the current setup seems best as in. There is no point running RNNs on GPUs with Theano.
Btw. could you get rnn_ctc to work with anything other than the Hindu numerals?
I just find the reason why it is so slow on GPUs, because the model is not complicated enough. Once I use 2000 BLSTM hidden units, the GPUs load rise to 98%, also it is 10x faster on GPUs than On CPUs as follow.
@aaron-xichen , wow! I can not believe you can run 2000 BLSTM units. Can you fork this project and make changes to your branch? So that I can see how you are profiling. I have difficulty profiling. Or you could just send a gist link to the files you changed. Thanks a lot.
Do you mean 2000 BLSTM can be too slow for you to run? In fact I do not change so much, only change from 90 to 2000 of config number 10 in configurations.py. As for the profiling, I have not clean up my code yet, which is a mess, sorry for that. But I can tell you the idea, which is using multiply theano function to output the intermediate result of different stages and see the cost time, for example, cst1 for stage1, cst2 for stage1 and stage2. Then with cst2 - cst1 you can get the running time of stage2 roughly. Is that make sense?
It would be much easier for me to just take a look at the profiling code and see what you are doing.
iter_begin = time.time()
ntwk.stage1()
counter1 = time.time()
st1 = counter1 - iter_begin
ntwk.stage2()
counter2 = time.time()
st2 = counter2 - counter1
ntwk.stage3()
counter3 = time.time()
st3 = counter3 - counter2
ntwk.trainer()
counter4 = time.time()
st4 = counter4 - counter3
print("stage1:{0:0.3f}, stage2:{1:0.3f}, stage3:{2:0.3f}, bp:{3:0.3f}".
format(st1, st2 - st1, st3 - st2, st4 - st3))
Is it clear enough?
So you are compiling three functions? stage1, stage2 and stage3?
In fact it is running these three functions, not compiling. Compiling code is like this,
self.stage1 = th.function(
inputs = [],
outputs = [layer1.output],
givens = {
image:x
}
)
self.stage2 = th.function(
inputs = [],
outputs = [layer2.output],
givens = {
image:x
}
)
self.stage3= th.function(
inputs = [],
outputs=[layer3.cost],
givens = {
image: x,
labels: y,
})
self.trainer = th.function(
inputs = [],
outputs=[layer3.cost],
updates=updates,
givens = {
image: x,
labels: y,
})
Got it. Thanks for the profiling. My other question. Did you get any other example to work (other than Hindu numerals)?
In fact I haven't tried yet, this result is based on my own dataset, which is similar to the ascii dataset
Great to hear GPU could run faster. However with GPU K20, it seems that gpu compiles a lot slower than cpu. I have to wait long and the running stage is still "building model". How long will that take to start training BLSTM 2000 on GPU?
The first time may cost a lot, maybe 1.5min on TITAN X. However it seems that theano has some cache mechanism which will accelerate the compiling stage in the following tries, which cost me less than 1 min. Hope it helps
Currently training is slower on GPUs than on CPUs, because the training data is not a shared variable.