keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.9k stars 19.45k forks source link

Decouple LSTM input and recurrent parts #2698

Closed naure closed 7 years ago

naure commented 8 years ago

The RNN layers include a matrix multiplication of their input. However when the input is a trainable embedding, this linear transformation is redundant in terms of both parameters and computation, and actually creates an ill-defined optimization problem. I suggest trimming down the RNN implementations to the recurrent part, and reusing the Dense layer for the input transformation.

Example current stack:

Input: N-vector LSTM input part: 4 gates at NxN LSTM recurrent part: NxN

The decoupling would allow the following use-case:

Trainable Embedding: 4N-vector LSTM recurrent part: 4 gates at NxN

The existing full LSTM would be re-implemented as:

Input: M-vector TimeDistributed(Dense()): 4NxM LSTM recurrent part: NxN

This removes all redundant code involving inputs (weights, bias, regularizations, …) out of each RNN implementation, reusing the code of the Dense layer. It also allows experimenting with any other combination of layers.

I implemented it for myself, and I heard others had the same need, so I can contribute it if you are interested in this change.

braingineer commented 8 years ago

so, the rnn is h = activation(K.dot(K.dot(x,W), U)). You're suggesting replacing the K.dot(x,W) and just have a recurrent connection?

How would that work with LSTMs/GRUs which use the input to calculate the various gates?

aka, in the LSTM, x is used as

                x_i = K.dot(x * B_W[0], self.W_i) + self.b_i
                x_f = K.dot(x * B_W[1], self.W_f) + self.b_f
                x_c = K.dot(x * B_W[2], self.W_c) + self.b_c
                x_o = K.dot(x * B_W[3], self.W_o) + self.b_o
naure commented 8 years ago

The input is split in 4 for LSTM and 3 for GRU:

            x_i = x[:, :self.output_dim]
            x_f = x[:, self.output_dim: 2 * self.output_dim]
            x_c = x[:, 2 * self.output_dim: 3 * self.output_dim]
            x_o = x[:, 3 * self.output_dim:]

Corrected the post above.

naure commented 8 years ago

There are no activation on the input, so it is indeed equivalent to TimeDistributed(Dense()). It is actually currently implemented with time_distributed_dense().

In the architecture that I actually use, there is no input transformation, all inputs are embeddings that are trained.

Le jeu. 12 mai 2016 à 18:24, Carl Thomé notifications@github.com a écrit :

@naure https://github.com/naure, I think this is a cool suggestion, but I don't see how this could be implemented. How would the LSTM without Dense() work then? Could you provide some illustrations? As usual, I'm a bit confused: #2673 https://github.com/fchollet/keras/issues/2673

By the way, is it really a linear transformation? The LSTM activations default is a tanh. Do you mean that TimeDistributed(Dense()) should apply the tanh instead?

— You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/2698#issuecomment-218810208

braingineer commented 8 years ago

A couple more questions:

braingineer commented 8 years ago

also: I don't really know if I'm for or against it. I see no reason why it wouldn't work, but no reason why I'd personally use it.

You say

this linear transformation is redundant in terms of both parameters and computation, and actually creates an ill-defined optimization problem.

Is it really redundant? Is it really ill-defined?

Under your suggestion, Embedding -> TimeDistributed(Dense) -> RNN is the same thing as Embedding -> RNN in the current implementation. That doesn't seem redundant, but just an extra layer of dimension transformation. So, the only reason I would see to use it is if I didn't want that middle layer, for some computational, methodological, or theoretical reason.

Additionally, don't we lose the bayesian dropout by doing it your way? It's applied to the inputs prior to the weight calculation. We could just add it into the Dense, but then from an API standpoint, you're pushing extra bookkeeping into the TimeDistributed layer, correct?

    if dropout is not None and 0. < dropout < 1.:
        # apply the same dropout pattern at every timestep
        ones = K.ones_like(K.reshape(x[:, 0, :], (-1, input_dim)))
        dropout_matrix = K.dropout(ones, dropout)
        expanded_dropout_matrix = K.repeat(dropout_matrix, timesteps)
        x = K.in_train_phase(x * expanded_dropout_matrix, x)
stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.