Closed Froskekongen closed 7 years ago
Looking at this now. At a first glance it looks a little tricky because part of the cuDNN optimizations need to know exactly how many RNN layers there are in the model, and Keras doesn't (for example, you can append LSTM()
layers to Sequential()
as you go, or you could mix and match different RNN layers etc.).
tf.contrib.cudnn_rnn.CudnnLSTM(num_layers, num_units, input_size)
Perhaps a simple way of exposing TensorFlow's cuDNN RNNs would be to just add new layers to recurrent.py called CudnnLSTM, etc. @fchollet, thoughts?
I also assume these tf ops don't make much sense on a CPU, so perhaps it's not too bad to make the Cudnn prefix visible in Keras' API to make the GPU assumption extra obvious.
EDIT: Alternatively, during model.compile()
one could go through the layers and bucket any stacked RNNs of the same type, number of units and input size automatically and just swap them with a corresponding CudnnRNN.
Does it make any sense to start with an implementation that doesn't use the stacking optimization and just pass num_layers=1 into CudnnLSTM?
No. Stacking at least two RNNs is extremely common and one of the largest performance gains comes from utilizing that. See "Step 3" at https://devblogs.nvidia.com/parallelforall/optimizing-recurrent-neural-networks-cudnn-5/
Right, I saw step 3; but that accounts for a 1.7x improvement, vs the other 6.5x; unless I'm misreading something.
I think you should read the table again. :wink:
So, my understanding of this table:
Optimization | GFLOPS | Speedup |
---|---|---|
Baseline | 349 | (1.0x) |
Combined GEMMs | 724 | 2.1x |
GEMM Streaming | 994 | 2.8x |
Fused point-wise operations | 1942 | 5.5x |
Matrix pre-transposition | 2199 | 6.3x |
Combining Inputs | 2290 | 6.5x |
Four layers | 3898 | 11.1x |
Is that the speedup column is cummulative, i.e. Doing everything up to "Combining Inputs" gets you a 6.5x speedup compared to baseline. Then doing the 4 layers at a time gets you an additional 1.7x speedup to an overall 11.1x speedup compared to baseline.
It's a 70% increase (or 4.6x faster than the baseline with the layer optimizations, although the optimizations might depend on each other), no?
Right, combining layers is a 70% increase; I'm not super clear on where you get the 4.6x from though.
I'm just suggesting we might get a good speedup from stacking CudnnLSTMs even if they can't do cross-layer speedups. Maybe it would be best to just benchmark that and see how it goes rather than trying to guess the interactions. Getting it to work on 1 layer is probably step 1 in any progress on the larger goal anyway.
Btw, do you know how representative the baseline is?
I'd prefer to wait on Theano (this week, hopefully) and think a Keras implementation through properly after having seen the mutual aspects of both backends.
How representative? I'd wager NVIDIA's native CUDA baseline is faster than Keras because it doesn't have to go through Python, but who knows?
cudnn speed up isn't related to python or not. Theano and TensorFlow don't ahve Python code for the execution engine most of the time.
The speed up come from many low level optimization. See the tables. In an ideal world, Theano/Tensorflow could do all of them automatically, but it is easier to do them once for a specific/few similar algo. This is what cudnn RNN did.
On Wed, Sep 7, 2016 at 11:19 AM, Carl Thomé notifications@github.com wrote:
I'd prefer to wait on Theano (this week, hopefully) and think a Keras implementation through properly after having seen the mutual aspects of both backends.
How representative? I'd wager NVIDIA's native CUDA baseline is faster than Keras because it doesn't have to go through Python, but who knows?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/3692#issuecomment-245314924, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-5uMNohUwSNoHdQ09FUNe7njtqi5ks5qntXtgaJpZM4J0z5M .
It seems impossible to make this 100% compatible with our own LSTM
implementation, so I would suggest adding a new Keras layer instead. In
fact there are several TF ops that I think would benefit from a shallow
Keras wrapping (including these), so we may want to start a
keras.tensorflow
module.
When Theano has its own version of it we can consider adding a
cross-backend layer in keras.layer.recurrent
.
On 7 September 2016 at 08:49, Frédéric Bastien notifications@github.com wrote:
cudnn speed up isn't related to python or not. Theano and TensorFlow don't ahve Python code for the execution engine most of the time.
The speed up come from many low level optimization. See the tables. In an ideal world, Theano/Tensorflow could do all of them automatically, but it is easier to do them once for a specific/few similar algo. This is what cudnn RNN did.
On Wed, Sep 7, 2016 at 11:19 AM, Carl Thomé notifications@github.com wrote:
I'd prefer to wait on Theano (this week, hopefully) and think a Keras implementation through properly after having seen the mutual aspects of both backends.
How representative? I'd wager NVIDIA's native CUDA baseline is faster than Keras because it doesn't have to go through Python, but who knows?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/3692#issuecomment-245314924, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC- 5uMNohUwSNoHdQ09FUNe7njtqi5ks5qntXtgaJpZM4J0z5M .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/3692#issuecomment-245325653, or mute the thread https://github.com/notifications/unsubscribe-auth/AArWb-M5tkQHbS1-P7Kqu9Bof1nuaj21ks5qnt0agaJpZM4J0z5M .
Theano has now merged Cudnn v5 bindings https://github.com/Theano/Theano/pull/4915
@fchollet, I propose adding a StackedLSTM
layer that takes number of layers as an additional parameter in order to fully utilize the optimizations. Would such a pr be of interest?
First of all it would be necessary to quantify what there is to gain in terms of performance.
On 21 October 2016 at 06:23, Carl Thomé notifications@github.com wrote:
@fchollet https://github.com/fchollet, I propose adding a StackedLSTM layer that takes number of layers as an additional parameter in order to fully utilize the optimizations. Would such a pr be of interest?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/3692#issuecomment-255375348, or mute the thread https://github.com/notifications/unsubscribe-auth/AArWb9enT62EIEKU4UhIWvc2w7kWj9mYks5q2LzJgaJpZM4J0z5M .
some benchmark on the Theano side with and without cudnn rnn:
https://github.com/MaximumEntropy/cudnn_rnn_theano_ benchmarks/blob/master/README.md
The answer, yes, there is speed up by using it.
On Tue, Oct 25, 2016 at 2:23 PM, François Chollet notifications@github.com wrote:
First of all it would be necessary to quantify what there is to gain in terms of performance.
On 21 October 2016 at 06:23, Carl Thomé notifications@github.com wrote:
@fchollet https://github.com/fchollet, I propose adding a StackedLSTM layer that takes number of layers as an additional parameter in order to fully utilize the optimizations. Would such a pr be of interest?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/3692#issuecomment-255375348, or mute the thread https://github.com/notifications/unsubscribe-auth/ AArWb9enT62EIEKU4UhIWvc2w7kWj9mYks5q2LzJgaJpZM4J0z5M .
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/3692#issuecomment-256124096, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-3qrMJN0YBGafrmmukvmw94pL97oks5q3kkugaJpZM4J0z5M .
Recently an implementation of accelerated RNNs using cudnn v5 was exposed in tensorflow: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/cudnn_rnn
If these implementations are faster and good enough for usage, they should be exposed in keras.