Add dropout and recurrent_dropout to CuDNNLSTM and CuDNNGRU

bzamecnik commented 6 years ago

Native Keras GRU and LSTM layers support dropout and recurrent_dropout, but their CuDNN-accelerated counterparts, CuDNNLSTM and CuDNNGRU, do not. It might be good to add these features. Although CuDNN RNNs do not support dropout natively, it seems to be possible to implement it outside of CuDNN. At least TensorFlow is capable of that. In Keras dropout can be applied either on inputs (dropout), which should be straightforward, or on previous hidden state (recurrent_dropout). I'm not sure if the latter might be possible, tough.

The reason is using CuDNN RNN implementation for fast training and allow dropout regularization at the same time.

Please comment if this makes sense or it is wanted. I'd be happy to try implementing that. Thanks.

whatever1983 commented 6 years ago

This is what is desperately needed. Plus, in layer_CuDNN_LSTM, the dtype = float16 is not enabled for Nvidia's CUDA 9.1 FP16 training.

tRosenflanz commented 6 years ago

I guess applying Dropout(x)(inputs) before the LSTM layer will do the same as the dropout, right? Or do you think it can cause a slow down?

ml-pickle commented 6 years ago

This was would be tremendously helpful to many, many people. Not being able to use dropout often renders CuDNN layers virtually useless for training smaller datasets.

tRosenflanz commented 6 years ago

Recurrent dropout is still not supported with Tensorflow, if you would like to see it please submit the request there. The input dropout can easily be achieved by adding a dropout layers before the CuDNRNN layer manually

bitmanlger commented 6 years ago

@tRosenflanz are you sure? https://github.com/tensorflow/tensorflow/issues/6466#issuecomment-339517889

tRosenflanz commented 6 years ago

If I am reading the tensorflow thread right, it says that the dropout they support is applied between layers only and not on the hidden states that get passed from CuDNN cell to CuDNN cell within one layer. The dropout they support is equivalent to adding a dropout layer yourself as far as I understand.

bitmanlger commented 6 years ago

@tRosenflanz Sorry, you're right. http://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnDropoutDescriptor_t suggests that " Dropout will be applied between layers " this applies to the latest CuDNN. :( ref: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/cudnn_rnn/kernels/cudnn_rnn_ops.cc#L549

fchollet commented 6 years ago

Recurrent dropout is not implemented in cuDNN RNN ops. At the cuDNN level. So we can't have it in Keras.

The dropout option in the cuDNN API is not recurrent dropout (unlike what is in Keras), so it is basically useless (regular dropout doesn't work with RNNs).

Actually using such dropout in a stacked RNN will wreck training.

smyskoff commented 6 years ago

Will time-distributed dropout solve the problem? Something like this:

...
for idx in range(num_layers):
    top_layer = idx == num_layers - 1
    layer = CuDNNLSTM(..., return_sequences=top_layer)(layer)
    if not top_layer:
        layer = TimeDistributed(Dropout(dropout))(layer)

...

smyskoff commented 6 years ago

Oh, it looks like even simpler due to documentation: https://keras.io/layers/core/#dropout

noise_shape: 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input. For instance, if your inputs have shape (batch_size, timesteps, features) and you want the dropout mask to be the same for all timesteps, you can use noise_shape=(batch_size, 1, features).

tRosenflanz commented 6 years ago

This will not produce the recurrent dropout. It will apply dropout between layers of the network while recurrent dropout works on the states that are passed within the same layer. Since CuDNN layer works through calling CuDNN layer and doesn't rely on the cells implementation, Keras team cannot do anything to it

evictor commented 6 years ago

@fchollet can you elaborate on these comments:

regular dropout doesn't work with RNNs

Actually using such dropout in a stacked RNN will wreck training.

changbinlu commented 6 years ago

Any updates on this problem? The built-in dropout truly wreck training.

rsmith49 commented 6 years ago

This paper mentions using DropConnect (Dropout applied to the weights, instead of the state vector) on the recurrent weights in an LSTM in order to have some dropout without changing the cuDNN implementation. They say that for each batch in training they perform dropout on the weights before the forward and backward propagation, and repeat for the next batch. From the paper:

We propose the use of DropConnect (Wan et al., 2013) on the recurrent hidden to hidden weight matrices which does not require any modifications to an RNN’s formu- lation. As the dropout operation is applied once to the weight matrices, before the forward and backward pass, the impact on training speed is minimal and any standard RNN implementation can be used, including inflexible but highly optimized black box LSTM implementations such as NVIDIA’s cuDNN LSTM.

By performing DropConnect on the hidden-to-hidden weight matrices [Ui,Uf,Uo,Uc] within the LSTM, we can prevent overfitting from occurring on the recurrent connections of the LSTM. This regularization technique would also be applicable to preventing overfitting on the recurrent weight matrices of other RNN cells.

Is there any interest in implementing this as an option? I am not totally familiar with how dropout is applied in the Model and Sequential classes, but hopefully this would not be too hard to implement.

brunoalano commented 6 years ago

@rsmith49 You can use the TensorLayer implementation [1] of DropConnect directly on Keras. There's an example where you can interchange Keras & TensorLayer together [2].

[1] http://tensorlayer.readthedocs.io/en/latest/modules/layers.html#dropconnect-dense-layer [2] https://github.com/tensorlayer/tensorlayer/blob/master/example/tutorial_keras.py

scotthuang1989 commented 6 years ago

@fchollet when you say that:

Actually using such dropout in a stacked RNN will wreck training.

Do you refer to this paper?

rsmith49 commented 6 years ago

@brunoalano Do you know of any implementations of DropConnect applied to an LSTM layer? The link you provided only has DropconnectDenseLayer, and I did not find any in TensorLayer's recurrent.py.

moritzaugustin commented 6 years ago

untested implementation: https://github.com/andry9454/KerasDropconnect

rohankshir commented 5 years ago

^^ That implementation does not seem right..

OverLordGoldDragon commented 5 years ago

+1, would find useful

elisim commented 5 years ago

any progress?

DomnulCiotlausi commented 5 years ago

+1

solalatus commented 5 years ago

+1

zapaishchykova commented 5 years ago

+1

vb690 commented 4 years ago

+1

S-Abdelnabi commented 4 years ago

This paper mentions using DropConnect (Dropout applied to the weights, instead of the state vector) on the recurrent weights in an LSTM in order to have some dropout without changing the cuDNN implementation. They say that for each batch in training they perform dropout on the weights before the forward and backward propagation, and repeat for the next batch. From the paper:

We propose the use of DropConnect (Wan et al., 2013) on the recurrent hidden to hidden weight matrices which does not require any modifications to an RNN’s formu- lation. As the dropout operation is applied once to the weight matrices, before the forward and backward pass, the impact on training speed is minimal and any standard RNN implementation can be used, including inflexible but highly optimized black box LSTM implementations such as NVIDIA’s cuDNN LSTM.

By performing DropConnect on the hidden-to-hidden weight matrices [Ui,Uf,Uo,Uc] within the LSTM, we can prevent overfitting from occurring on the recurrent connections of the LSTM. This regularization technique would also be applicable to preventing overfitting on the recurrent weight matrices of other RNN cells.

Is there any interest in implementing this as an option? I am not totally familiar with how dropout is applied in the Model and Sequential classes, but hopefully this would not be too hard to implement.

I would like to ask if there are further updates regarding this (dropConnect on the recurrent connection)? I tried to implement a custom recurrent_regularizer that calls tf.nn.dropout on the hidden to hidden weights but I don't think it is working properly. The returned loss for some reason is an array of size(sequence_length,sequence_length*4).

icoxfog417 commented 4 years ago

Haste will be helpful to implement.

https://github.com/lmnt-com/haste

OverLordGoldDragon commented 4 years ago

@icoxfog417 Thanks for linking, they appear way ahead of TensorFlow on this.

doubleapple123 commented 4 years ago

Is it possible to use this https://github.com/lmnt-com/haste in windows 10 with tensorflow-gpu 2.0?

oren0e commented 4 years ago

Any progress or updates regarding implementing recurrent_dropout in tensorflow.keras? I have a performance problem as a result of this as well (detailed here: https://github.com/tensorflow/tensorflow/issues/40944)

keras-team / keras

Add dropout and recurrent_dropout to CuDNNLSTM and CuDNNGRU #8935