Closed vict0rsch closed 7 years ago
It may be, and it is not only a matter of GPU in these cases most of overhead is inside the scan loop for recurrent neural networks. Try to profile your output and you will understand where is the overhead. set training epoch to 1 and the theano flag profile=1. It will return a profile indicating where is the overhead of your model. Probably most of the time is spend inside loops.
Unfortunately in Keras at the moment there is no easy solution for that it's due to the fact that the Theano scan is really slow.
We merged a speed up in scan last week. So maybe updating Theano can speed thinkgs up. It was giving up to 10% speed up depending of the use case.
On Mon, Nov 23, 2015 at 9:20 AM, Daniele Bonadiman <notifications@github.com
wrote:
It may be, and it is not only a matter of GPU in these cases most of overhead is inside the scan loop for recurrent neural networks. Try to profile your output and you will understand where is the overhead. set training epoch to 1 and the theano flag profile=1. It will return a profile indicating where is the overhead of your model. Probably most of the time is spend inside loops.
Unfortunately in Keras at the moment there is no easy solution for that it's due to the fact that the Theano scan is really slow.
— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/1063#issuecomment-158944950.
Thanks you @nouiz . That's a good news.
It seems, in the output profile that indeed @dbonadiman most of the time is spent in the scan. This means, (thanks @nouiz) that there is nothing I can do to speed things up except updating theano and hoping for ~10% speed up?
I don't know how old is your theano version there has been other optimisation in the recent past maybe you can obtain something more.
In my model that uses 2 LSTM
i obtained a speedup from 38s-40s per epoch to 35s per epoch ~10% speed up as pointed out by @nouiz .
If you cannot deal with these times you can try to unfold the scan as it was done in the Lasagne
library but it works only some times and you need to partially modify Keras
.
If your Theano is a few mounts old, you could get more speed up from an update. We got very big speed up in Scan in the last ~6 mounts.
The problem isn't the scan, you spend 93% of time in scan. Scan have just a 2% overhead. Inside scan, you spend 80% of your time in gemm. So the gemm are your real bottleneck, not scan.
But there is trick that can be done. In the DLT LSTM[1] example, there is a trick that was used to speed up the computation. It is to bundle some of the weights in only one shared variable and do a big gemm instead of a few smaller one. I don't remember the speed difference, but it gave a significant difference, but I'm not sure of the magnitude.
There is another thread/PR to keras with speed up of 4x to a modele with scan that gave 4x speed up from memory of reading rapidly keras related emails. Check it, maybe you can reuse the same tricks.
[1]http://deeplearning.net/tutorial/lstm.html#lstm
On Mon, Nov 23, 2015 at 10:12 AM, Daniele Bonadiman < notifications@github.com> wrote:
I don't know how old is your theano version there has been other optimisation in the recent past maybe you can obtain something more. In my model that uses 2 LSTM i obtained a speedup from 38s-40s per epoch to 35s per epoch ~10% speed up as pointed out by @nouiz https://github.com/nouiz .
If you cannot deal with these times you can try to unfold the scan as it was done in the Lasagne library but it works only some times and you need to partially modify Keras.
— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/1063#issuecomment-158964124.
@pranv mentioned 4x speedup in PR with nerual turing machines.
following up @elanmart comment. @pranv suggestion was the same as the one proposed by @nouiz which is just concat a big tensor and run a single big gemm. @fchollet seemed interested in that approach in the discussion about generalized backends as well. At one point we will have to go back in our RNN models (LSTM, GRU, and Neural Turing Machines) and unify the multiplications for speed up. This will make the code less readable, but people usually feel happier with performance. Also, with good comments on the code the problem can be reduced.
But for now, @Vict0rSch please try updating your Theano and letting us know what happens.
It appears my version of Theano was not so old since I did'nt see any improvement updating Theano... Thank you very much anyway, I keep you posted
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
Hello, I am training a single LSTM layer with these parameters : number of examples : 27,000 size of sample : (2500, 1 , dtype=float64) size of target : (250, 1 , dtype=float64) batch size : 64 activation : linear
I use a GeForce GTX 970 GPU and the training is really slow : a single epoch takes 678 seconds, which seems really slow. Any idea of what could be happening/how I can speed things up?
thanks!