Closed naure closed 7 years ago
so, the rnn is h = activation(K.dot(K.dot(x,W), U))
. You're suggesting replacing the K.dot(x,W)
and just have a recurrent connection?
How would that work with LSTMs/GRUs which use the input to calculate the various gates?
aka, in the LSTM, x is used as
x_i = K.dot(x * B_W[0], self.W_i) + self.b_i
x_f = K.dot(x * B_W[1], self.W_f) + self.b_f
x_c = K.dot(x * B_W[2], self.W_c) + self.b_c
x_o = K.dot(x * B_W[3], self.W_o) + self.b_o
The input is split in 4 for LSTM and 3 for GRU:
x_i = x[:, :self.output_dim]
x_f = x[:, self.output_dim: 2 * self.output_dim]
x_c = x[:, 2 * self.output_dim: 3 * self.output_dim]
x_o = x[:, 3 * self.output_dim:]
Corrected the post above.
There are no activation on the input, so it is indeed equivalent to TimeDistributed(Dense()). It is actually currently implemented with time_distributed_dense().
In the architecture that I actually use, there is no input transformation, all inputs are embeddings that are trained.
Le jeu. 12 mai 2016 à 18:24, Carl Thomé notifications@github.com a écrit :
@naure https://github.com/naure, I think this is a cool suggestion, but I don't see how this could be implemented. How would the LSTM without Dense() work then? Could you provide some illustrations? As usual, I'm a bit confused: #2673 https://github.com/fchollet/keras/issues/2673
By the way, is it really a linear transformation? The LSTM activations default is a tanh. Do you mean that TimeDistributed(Dense()) should apply the tanh instead?
— You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/2698#issuecomment-218810208
A couple more questions:
also: I don't really know if I'm for or against it. I see no reason why it wouldn't work, but no reason why I'd personally use it.
You say
this linear transformation is redundant in terms of both parameters and computation, and actually creates an ill-defined optimization problem.
Is it really redundant? Is it really ill-defined?
Under your suggestion, Embedding -> TimeDistributed(Dense) -> RNN
is the same thing as Embedding -> RNN
in the current implementation. That doesn't seem redundant, but just an extra layer of dimension transformation. So, the only reason I would see to use it is if I didn't want that middle layer, for some computational, methodological, or theoretical reason.
Additionally, don't we lose the bayesian dropout by doing it your way? It's applied to the inputs prior to the weight calculation. We could just add it into the Dense, but then from an API standpoint, you're pushing extra bookkeeping into the TimeDistributed layer, correct?
if dropout is not None and 0. < dropout < 1.:
# apply the same dropout pattern at every timestep
ones = K.ones_like(K.reshape(x[:, 0, :], (-1, input_dim)))
dropout_matrix = K.dropout(ones, dropout)
expanded_dropout_matrix = K.repeat(dropout_matrix, timesteps)
x = K.in_train_phase(x * expanded_dropout_matrix, x)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.
The RNN layers include a matrix multiplication of their input. However when the input is a trainable embedding, this linear transformation is redundant in terms of both parameters and computation, and actually creates an ill-defined optimization problem. I suggest trimming down the RNN implementations to the recurrent part, and reusing the Dense layer for the input transformation.
Example current stack:
The decoupling would allow the following use-case:
The existing full LSTM would be re-implemented as:
This removes all redundant code involving inputs (weights, bias, regularizations, …) out of each RNN implementation, reusing the code of the Dense layer. It also allows experimenting with any other combination of layers.
I implemented it for myself, and I heard others had the same need, so I can contribute it if you are interested in this change.