keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.9k stars 19.45k forks source link

Exposing Cudnn v5 RNN models (in tensorflow) from within keras #3692

Closed Froskekongen closed 7 years ago

Froskekongen commented 8 years ago

Recently an implementation of accelerated RNNs using cudnn v5 was exposed in tensorflow: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/cudnn_rnn

If these implementations are faster and good enough for usage, they should be exposed in keras.

carlthome commented 8 years ago

Looking at this now. At a first glance it looks a little tricky because part of the cuDNN optimizations need to know exactly how many RNN layers there are in the model, and Keras doesn't (for example, you can append LSTM() layers to Sequential() as you go, or you could mix and match different RNN layers etc.).

tf.contrib.cudnn_rnn.CudnnLSTM(num_layers, num_units, input_size)

Perhaps a simple way of exposing TensorFlow's cuDNN RNNs would be to just add new layers to recurrent.py called CudnnLSTM, etc. @fchollet, thoughts?

I also assume these tf ops don't make much sense on a CPU, so perhaps it's not too bad to make the Cudnn prefix visible in Keras' API to make the GPU assumption extra obvious.

EDIT: Alternatively, during model.compile() one could go through the layers and bucket any stacked RNNs of the same type, number of units and input size automatically and just swap them with a corresponding CudnnRNN.

kuza55 commented 8 years ago

Does it make any sense to start with an implementation that doesn't use the stacking optimization and just pass num_layers=1 into CudnnLSTM?

carlthome commented 8 years ago

No. Stacking at least two RNNs is extremely common and one of the largest performance gains comes from utilizing that. See "Step 3" at https://devblogs.nvidia.com/parallelforall/optimizing-recurrent-neural-networks-cudnn-5/

kuza55 commented 8 years ago

Right, I saw step 3; but that accounts for a 1.7x improvement, vs the other 6.5x; unless I'm misreading something.

carlthome commented 8 years ago

I think you should read the table again. :wink:

kuza55 commented 8 years ago

So, my understanding of this table:

Optimization GFLOPS Speedup
Baseline 349 (1.0x)
Combined GEMMs 724 2.1x
GEMM Streaming 994 2.8x
Fused point-wise operations 1942 5.5x
Matrix pre-transposition 2199 6.3x
Combining Inputs 2290 6.5x
Four layers 3898 11.1x

Is that the speedup column is cummulative, i.e. Doing everything up to "Combining Inputs" gets you a 6.5x speedup compared to baseline. Then doing the 4 layers at a time gets you an additional 1.7x speedup to an overall 11.1x speedup compared to baseline.

carlthome commented 8 years ago

It's a 70% increase (or 4.6x faster than the baseline with the layer optimizations, although the optimizations might depend on each other), no?

kuza55 commented 8 years ago

Right, combining layers is a 70% increase; I'm not super clear on where you get the 4.6x from though.

I'm just suggesting we might get a good speedup from stacking CudnnLSTMs even if they can't do cross-layer speedups. Maybe it would be best to just benchmark that and see how it goes rather than trying to guess the interactions. Getting it to work on 1 layer is probably step 1 in any progress on the larger goal anyway.

Btw, do you know how representative the baseline is?

carlthome commented 8 years ago

I'd prefer to wait on Theano (this week, hopefully) and think a Keras implementation through properly after having seen the mutual aspects of both backends.

How representative? I'd wager NVIDIA's native CUDA baseline is faster than Keras because it doesn't have to go through Python, but who knows?

nouiz commented 8 years ago

cudnn speed up isn't related to python or not. Theano and TensorFlow don't ahve Python code for the execution engine most of the time.

The speed up come from many low level optimization. See the tables. In an ideal world, Theano/Tensorflow could do all of them automatically, but it is easier to do them once for a specific/few similar algo. This is what cudnn RNN did.

On Wed, Sep 7, 2016 at 11:19 AM, Carl Thomé notifications@github.com wrote:

I'd prefer to wait on Theano (this week, hopefully) and think a Keras implementation through properly after having seen the mutual aspects of both backends.

How representative? I'd wager NVIDIA's native CUDA baseline is faster than Keras because it doesn't have to go through Python, but who knows?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/3692#issuecomment-245314924, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-5uMNohUwSNoHdQ09FUNe7njtqi5ks5qntXtgaJpZM4J0z5M .

fchollet commented 8 years ago

It seems impossible to make this 100% compatible with our own LSTM implementation, so I would suggest adding a new Keras layer instead. In fact there are several TF ops that I think would benefit from a shallow Keras wrapping (including these), so we may want to start a keras.tensorflow module.

When Theano has its own version of it we can consider adding a cross-backend layer in keras.layer.recurrent.

On 7 September 2016 at 08:49, Frédéric Bastien notifications@github.com wrote:

cudnn speed up isn't related to python or not. Theano and TensorFlow don't ahve Python code for the execution engine most of the time.

The speed up come from many low level optimization. See the tables. In an ideal world, Theano/Tensorflow could do all of them automatically, but it is easier to do them once for a specific/few similar algo. This is what cudnn RNN did.

On Wed, Sep 7, 2016 at 11:19 AM, Carl Thomé notifications@github.com wrote:

I'd prefer to wait on Theano (this week, hopefully) and think a Keras implementation through properly after having seen the mutual aspects of both backends.

How representative? I'd wager NVIDIA's native CUDA baseline is faster than Keras because it doesn't have to go through Python, but who knows?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/3692#issuecomment-245314924, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC- 5uMNohUwSNoHdQ09FUNe7njtqi5ks5qntXtgaJpZM4J0z5M .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/3692#issuecomment-245325653, or mute the thread https://github.com/notifications/unsubscribe-auth/AArWb-M5tkQHbS1-P7Kqu9Bof1nuaj21ks5qnt0agaJpZM4J0z5M .

zpardos commented 8 years ago

Theano has now merged Cudnn v5 bindings https://github.com/Theano/Theano/pull/4915

carlthome commented 8 years ago

@fchollet, I propose adding a StackedLSTM layer that takes number of layers as an additional parameter in order to fully utilize the optimizations. Would such a pr be of interest?

fchollet commented 7 years ago

First of all it would be necessary to quantify what there is to gain in terms of performance.

On 21 October 2016 at 06:23, Carl Thomé notifications@github.com wrote:

@fchollet https://github.com/fchollet, I propose adding a StackedLSTM layer that takes number of layers as an additional parameter in order to fully utilize the optimizations. Would such a pr be of interest?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/3692#issuecomment-255375348, or mute the thread https://github.com/notifications/unsubscribe-auth/AArWb9enT62EIEKU4UhIWvc2w7kWj9mYks5q2LzJgaJpZM4J0z5M .

nouiz commented 7 years ago

some benchmark on the Theano side with and without cudnn rnn:

https://github.com/MaximumEntropy/cudnn_rnn_theano_ benchmarks/blob/master/README.md

The answer, yes, there is speed up by using it.

On Tue, Oct 25, 2016 at 2:23 PM, François Chollet notifications@github.com wrote:

First of all it would be necessary to quantify what there is to gain in terms of performance.

On 21 October 2016 at 06:23, Carl Thomé notifications@github.com wrote:

@fchollet https://github.com/fchollet, I propose adding a StackedLSTM layer that takes number of layers as an additional parameter in order to fully utilize the optimizations. Would such a pr be of interest?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/3692#issuecomment-255375348, or mute the thread https://github.com/notifications/unsubscribe-auth/ AArWb9enT62EIEKU4UhIWvc2w7kWj9mYks5q2LzJgaJpZM4J0z5M .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/3692#issuecomment-256124096, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-3qrMJN0YBGafrmmukvmw94pL97oks5q3kkugaJpZM4J0z5M .