Closed pranv closed 7 years ago
I think this is a great idea! I posted about this in the google group, too. I think a set of layers that support recurrent outputs would be awesome (or even a Recurrent model). For instance, I believe the current LSTM implementation backprops error derived from the last output on some classification task. While Keras does support outputing sequences at each recurrent timestep, to my knowledge, it doesn't support much other than simply feeding that into another RNN.
In practice, a lot of algorithms use the output at every timestep for the loss function directly. The way to train language models in Keras seems a bit strange, as a result. The example's I've seen are trained to predict the last character/word in a sequence, rather than at every timestep. The thing that a lot of researchers do is train RNNs to predict the next output at each timestep, and the final loss function is a sum over all timesteps.
The LSTM in http://arxiv.org/pdf/1411.4555.pdf is a good example of this. At each timestep, a softmax is applied to the output, and the next word is predicted. The model's total loss function is designed to be a sum over all timesteps. I don't think something like this is currently possible, though I am fairly new to using Keras, so I could be wrong.
The model's total loss function is designed to be a sum over all timesteps. I don't think something like this is currently possible, though I am fairly new to using Keras, so I could be wrong.
It is definitely possible, you just have to return the full sequences and set your targets to be sequences as well.
The recurrent model idea sounds interesting, but I'm not sure I get it. Here are a few questions:
what are the use cases?
Recurrent Convolution Network, or a really deep LSTM (with more complex gating functions). Maybe, in future, if we have some layer like AttentionWindow
(I'm looking into it as well), it can be added to create common encoder-decoder pair with attention (which has given state of the art performance in a lot of tasks).
what would it make possible that would be impossible without (keep in mind that we will introduce the generic TimeDistributed layer soon)?
The aim is to have normal layers play a role in recurrent environment, without any considerations of a Recurrent environment
what is the role of our existing recurrent layers in this system?
The existing layers can be added to this model as well. But having gating mechanisms as activations could make them a bit out of order.
Edit: accidentally hit comment without completing. Fixed it.
The aim is to have normal layers play a role in recurrent environment, without any considerations of a Recurrent environment
Wouldn't that be handled by the TimeDistributed layer?
Of course, my mistake -- apologies.
As far as I understand, you're talking about stacking recurrent layers. I'm suggesting that we make each recurrent step a combination of computation of multiple layers.
On Mon, Jul 27, 2015 at 10:39 AM, jmhessel notifications@github.com wrote:
Of course, my mistake -- apologies.
Reply to this email directly or view it on GitHub: https://github.com/fchollet/keras/issues/443#issuecomment-125087639
See models like these for more reference:
@pranv Does that first link use a deep network for gating? It's hard to tell from skimming exactly what's getting built in https://github.com/kyunghyuncho/dl4mt-material/blob/master/session2/nmt.py#L464.
Anyway, I think the idea of having arbitrary user-defined networks for computing gating values is interesting, has that come up in any papers?
@iskandr I think all new RNNs can be implemented this way. A quick search on arxiv led to these papers by DeepMind:
See these other papers as well:
Basically any new RNN type, that doesn't simply use stacked RNNs, can be implemented in this approach.
I think I can come with an implementation with a few hours of coding, based on some knowledge I've gained seeing the implementation of Graph
models.
Would you like me to try?
I think it's a great idea. I thought about something similar in the context of sequence generators. Many new RNNs (like attention-enhanced multi-layer decoders) cannot be easily implemented in Keras, which is quite unfortunate.
@pranv are You looking into this? Maybe You could use some help?
RNNs in 2015 are Conv Nets in 2013.
Totally agree.
@elanmart I was hoping to get a 'go' from someone. Now that you have, I'll start work on it.
You looking into this? Maybe You could use some help?
Definitely
Do you have any concrete ideas about how the model should work and what should be implemented?
Also I think that if Keras is to support continuus inputs (e.g. for language models) the main Model class has to be modified to allow shuffling the training data so that _rowi of batch k-1 is the predeccesor of _rowi of batch k in the original input space.
I think we should have a model that accepts a graph. Rather than just one feed forward pass off the data, it's run over all time steps. Its very similar to other RNN models in keras, except that the step
function is a container of sorts, defined by the user.
I think LSTMs and GRU would have to be reduced to activations.
@elanmart I didn't get the shuffling part though
@fchollet I need some advice to go ahead with this. I'll explain what I'm trying to achieve with this kind of model.
Consider you want to build a DRAW like model. The DRAW model has 4 unique parts:
It also has a sampler and other things, but let's just skip it for now.
Reader and Writer are attention windows, and that can be implemented as layers fairly easily. Encoder and Decoder are themselves Recurrent, and the whole model generated by these 4 components is also recurrent. There are MLPs before each RNN as well.
And other encoder-decoder architectures are similar - to my knowledge.
My idea to enable implementation of these kind of models was to have a Recurrent
model and a corresponding container. This perticular implementation would be defined by:
Dense
layer and an LSTM as an activation
which will act as encoder and decoderRecurrent
model, along with attention windows.Now, I have 2 simple questions:
model
or container
is just a wrapper for theano.scan
. What would the model/container accept? Graph containers? Layers?@pranv I've been thinking about this as well.
I think we should have a Recurrent
model+container. It could be trained on it's own or used as a Layer
, which is crucial for example in caption generation.
The Recurrent
model should be as flexible as possible, therefore it should allow:
input
and internal_state
(for example attention)Also, about statefullness: I think every model (container) should have a list of shared variables: internal_states
. Of course the updates
would have to be passed to the model by each layer with statefull == True
and then added to the updates
returned by optimizer.
I think Recurrent
model is essential for Keras, since currently implemented recurrent layers are not really usefull for new architectures.
@elanmart Agreed!
I just need to wrap my head around what the user actually sees and how it works - as a coherent system, much like Sequential
and Graphs
.
Also, I don't think that there are other libraries that do this well. Thus, this is really really important
@pranv I agree, other libraries are also defficient in this matter.
Let's see. I think that:
Recurrent
model will accept single layers, just as Graph
does,input
and how is it computed given all previous internal_states
. Model
.Then we need to somehow plug this into scan
.
Edit: we would also somehow have to handle the prediction, where output is fed back as input step by step.
@elanmart
Recurrent model will accept single layers, just as Graph does
Won't this make the two model's code very redundant?
Every time a layer is added, user specifies: how is it computed given input and how is it computed given all previous internal_states.
Do we need internal states or is it taken out as output and Refed as input? Also, a way to split inputs will be necessary if we go with the latter choice.
Is the output of this layer statefull and returned to parent Model.
This is a good idea. State-full is better than not.
Do you see this as a valuable thing?
I'm still not sure what this container/model could do that the upcoming TimeDistributed
layer wouldn't do. TimeDistributed
would be a layer-level implementation of what you're describing (a wrapper for scan
). It allows to use any Keras layer in a recurrent architecture (but layers that make use of past states will still need to be implemented as recurrent layers, like today).
Adding only a layer to the codebase is much simpler and has a much lower risk of issues than adding an entire new container. What is the justification for the model?
I think the fundamental issue here is the notion of computational containment. In Keras all computation happens in layers, which makes it easy to write, understand and debug. This model seems to be proposing to distribute computation outside of layers as well, in the container (which is supposed to be a dumb data router). Specifically the following:
Every time a layer is added, user specifies: how is it computed given input and how is it computed given all previous internal_states.
How would you implement this. I keep getting back to a design where in the model or container is just a wrapper for theano.scan.
Why not a wrapper for theano.scan
? What is this supposed to do that scan
doesn't?
What would the model/container accept? Graph containers? Layers?
Containers and layers have the same API. New containers can be built using either layers or Sequential/Graph containers (with Graph containers being limited to 1 input and 1 output in this use case).
I'm still not sure what this container/model could do that the upcoming TimeDistributed layer wouldn't do. TimeDistributed would be a layer-level implementation of what you're describing (a wrapper for scan).
This is not aimed at making each layer TimeDustributed
. This is much more like making the Graph
container time distributed. Take the above example of DRAW architecture, and how would you implement such a thing using TimeDistributed
layers?
Point: The exact computation at each _step
of RNN is a combination of multiple layers worth of computation.
If the Graph
container is being made Time Distributed
, similar things can be achieved with it
If the Graph container is being made Time Distributed, similar things can be achieved with it
Using the TimeDistributed
layer, any Graph
(1-input and 1-output) or Sequential
container can be made time-distributed. It would also be possible to extend the API to generalize the application to multi-output, multi-input Graphs
as well. Does this answer your concerns?
Does this answer your concerns?
Yes. This idea will be unnecessary and redundant then.
Would it? To be honest, I'm still not sure I get it, so I can't be sure that the TimeDistributed
layer (again, it would just be a wrapper for scan
) is really equivalent to what you propose. So, are you sure about it?
In any case, it's important that you understand why the TimeDistributed
layer could also apply to any container (as long as the 1-input 1-output rule is respected): because layers and containers implement the same API. They are completely interchangeable. In the future we might get rid of the 1-input 1-output rule by extending the layer API to multiple inputs and outputs (like for Graph in the general case).
@fchollet, I'll post some code for a architecture based on a hypothetical recurrent
model. Maybe it'll help you understand what we're trying to do with this.
I'll keep the issue open so that maybe someone else has a take on this.. @wxs, @soumith please do have a look at this.
@fchollet This in Torch does something similar - https://github.com/Element-Research/rnn/blob/master/README.md#recurrent
A DRAW implementation in theano at https://github.com/jbornschein/draw.
@recurrent(sequences=['u'], contexts=['x'],
states=['c', 'h_enc', 'c_enc', 'z', 'kl', 'h_dec', 'c_dec'],
outputs=['c', 'h_enc', 'c_enc', 'z', 'kl', 'h_dec', 'c_dec'])
def apply(self, u, c, h_enc, c_enc, z, kl, h_dec, c_dec, x):
x_hat = x-T.nnet.sigmoid(c)
r = self.reader.apply(x, x_hat, h_dec)
i_enc = self.encoder_mlp.apply(T.concatenate([r, h_dec], axis=1))
h_enc, c_enc = self.encoder_rnn.apply(states=h_enc, cells=c_enc, inputs=i_enc, iterate=False)
z, kl = self.sampler.sample(h_enc, u)
i_dec = self.decoder_mlp.apply(z)
h_dec, c_dec = self.decoder_rnn.apply(states=h_dec, cells=c_dec, inputs=i_dec, iterate=False)
c = c + self.writer.apply(h_dec)
return c, h_enc, c_enc, z, kl, h_dec, c_dec
The whole thing is recurrent. The in the hypothetical Keras equivalent of the encoder segment:
model = Recurrent()
model.add_input(name="x")
model.add_states(['c', 'h_enc, 'c_enc', 'z', 'kl', 'h_dec', 'c_dec']) #states are passed to each recursive call
model.add(Activation('sigmoid'), input = 'c', name='sig_c')
model. add(ErrorLayer(), inputs=['x', 'sig_c'], name='x_hat')
model.add(AttentionReader(), inputs=['x', 'x_hat', 'h_dec'], name='r')
model.add(Dense(num_inputs, num_outputs), inputs = ['r', 'h_dec'], merge_mode='concat')
model.add(LSTM(), name = h_enc) # some modification to LSTM layer could be needed to return hidden state..
model.add(Sampler(), inputs = [h_enc, u], name = 'z')
This is really crude. But I hope I could convey something
@fchollet considering a basic example of soft attention*, I can't really see how TimeDistributedDense
would be usefull.
When computing hidden state h
for our RNN, we'd like to rely on a context vecotr c
, which itself depends on h_tm1
. I don't think this can be done with current Layers
+ propsed TimeDistributedDense
?
I'd love to hear Your thoughts on this, I might not fully understand how the TimeDistributedDense
would work.
This is really crude. But I hope I could convey something
Actually this attempt at sketching out an API is 10x more informative than the previous discussion. I think I get it now. Thanks!
I do believe this can be implemented via Graph models and TimeDistributedDense
(as could any other recurrent model). The general mechanism is that all processing that is synchronous (every timestep is processed in the same way, stochastically) will be wrapped into TimeDistributedDense
, and every processing that is non-synchronous (dependency on past time steps, i.e. recurrence) would have to happen inside a recurrent layer. In this example we will have to develop custom recurrent layers to make DRAW work.
However, I do believe that what you propose has value, essentially UX value. It introduces the notion of state, which is potentially useful. Also it would allow to create advanced recurrent models without having to develop as many custom recurrent layers (however the complexity will simply be passed on to the model architecture, so it's not clear yet if it's a gain).
When computing hidden state h for our RNN, we'd like to rely on a context vecotr c, which itself depends on h_tm1. I don't think this can be done with current Layers + propsed TimeDistributedDense?
You would have to develop a Graph-like recurrent layer. It's no big deal.
TL;DR: I think this is interesting, I'll think about it. Feel free to develop mock-APIs or actual code if you feel like it (though there is no guarantee that we will use it...).
@pranv Did you try to implement DRAW-attention or soft-attention what @elanmart mentioned using keras API?
I checked Cho's soft-attention code ( https://github.com/kyunghyuncho/dl4mt-material ) I am trying to convert it using Keras API, but It is not trivial to do it.
Any help and advice?
Hi. I was thinking about the Recurrent
model last weekend.
I think it would be nice to have it in Keras, since it could be backend-agnostic and ideally should be more convenient than implementing custom layers.
Regardles if @fchollet agrees or not, as an excercise (and for fun) I decided to implement it.
Before implementing the Recurrent
model itself, I'd like to first get a working multi-io layers though,
since I think they would be extremely nice to have.
The actual code is in very early stage, but the initial API I came up with goes something like this:
# --------------------- Create the encoder using standard Keras' functionality
encoder = Sequential()
encoder.add(Embedding(VOCAB_SIZE, 512))
encoder.add(GRU(input_dim=512, output_dim=512,
bidircetional=True, return_sequences=True, mode='concat'))
encoder.add(GRU(input_dim=512, output_dim=512,
bidircetional=True, return_sequences=True, mode='concat'))
# --------------------- Initialize the decoder
decoder_core = Recurrent()
decoder_core.add_input('encoded_sentence', sequence=False)
decoder_core.add_input('x', sequence=True)
decoder_core.add_state(name='h1', size=512, init='orthogonal',
learnable=True, statefull=False)
decoder_core.add_state(name='h2', size=512, init='orthogonal',
learnable=True, statefull=False)
# --------------------- Compute attention vectors
attention = Graph()
attention.add_input('conditioner')
attention.add_input('attended')
attention.add_node(RepeatConcat(),
inputs=['conditioner', 'attended'],
merge_mode='mapping',
input_map={'2d':'conditioner', '3d':'attended'},
name='concatter')
attention.add_node(TimeDistributedDense(1024, 256, activation='tanh'),
input='concatter',
name='att_hidden_layer')
attention.add_node(TimeDistributedDense(256, 1, activation='tanh'),
input='att_hidden_layer',
name='att_word_scorer')
attention.add_node(Permute(drop_axes=2, dimshuff=(0, 1,)),
input='att_word_scorer',
name='att_dimshuffle')
attention.add_node(Activation('softmax'),
input='att_dimshuffle',
name='att_softmax')
attention.add_node(WeightedSum(),
inputs=['att_softmax', 'attended'],
merge_mode='mapping',
input_map={'weights':'att_softmax', 'objects':'attended'}
name='context_creator')
attention.add_output(name='contexts', input='context_creator')
# --------------------- Compute the first hidden layer's state
h1 = Graph()
h1.add_input('x')
h1.add_input('h_tm1')
h1.add_input('contexts')
h1.add_node(Dense(input_dim=(EMB_SZ + HID_SZ + ATT_SZ),
output_dim=256,
activation='GRU'),
inputs=['x','h_tm1','contexts'],
merge_mode='concat',
name='h2h')
h1.add_output(input='h2h',
name='hid_update')
# --------------------- Compute the second hidden layer's state
h2 = Graph()
h2.add_input('x')
h2.add_input('h_tm1')
h2.add_node(Dense(input_dim=(2 * HID_SZ),
output_dim=256,
activation='GRU'),
inputs=['x','h_tm1'],
merge_mode='concat',
name='h2h')
h2.add_output(input='h2h',
name='hid_update')
# --------------------- Compose the core decoder
decoder_core.add_node(attention,
input_map={'conditioner':'h2_tm1',
'attended':'encoded_sentence'},
name='attention')
decoder_core.add_node(h1,
input_map={'x':'encoded_sentence',
'h_tm1':'h1_tm1',
'context':'attention'},
name='h1')
decoder_core.add_node(h2,
input_map={'x':'h1',
'h_tm1':'h2_tm1'},
name='h2')
# --------------------- Build the model used for training
decoder_learner = Sequential()
decoder_learner.add(decoder_core(return_sequences=True))
decoder_learner.add(TimeDistributedDense(input_dim=512,
output_dim=TARGET_VOCAB_SZ,
activation='softmax'))
decoder_learner.compile(optimizer=RMSProp(), loss='categorical_crossentropy', class_mode='categorical')
# --------------------- Build the model used for sampling
# Connect recurrent architectures vs build this shit later by copying weights from decoder_core?
decoder_sampler = Recurrent()
@elanmart Is there any update about your keras based nmt model code? looking forward to it.
@elanmart Can you provide me suggestion on how to implement sequence classification, where each sequence have different length without padding in keras
@Sri-Harsha
I think padding is unavoidable in Keras, if I'm not mistaken.
But if you are using an Embedding layer (Embedding), there is a mask_zero
argument you can set to true to mask out the padding values.
It shall help you to deal with sequences with varying lengths.
@jadore801120 I think you could have a separate model for each possible sequence length that share embedding/LSTM weights. Each batch could be formed from sequences of only a given length, and then a train_on_batch could be called for the corresponding model of the appropriate length. Masking might be more reasonable, though.
@elanmart Hi, I wanted to know if you were able to complete the Encoder-decoder network with Attention? Im stuck on how to add a Sequential layer to encode multiple sentences of a Document into sentence vectors. I'm referring to the first two layers of the below image. Would you have any leads. !
Hey,
I've been thinking about recurrent networks and how to implement them. I've looked at keras's implementation and some other libraries. What I found (or understood) was that most offer a standard step function - be it LSTM or GRU or something else. One can usually vary the number of inputs and outputs and that's about it.
So I started thinking about how to make arbitrary amount and type of processing per time step, which led me to think about a
Recurrent
model and the correspondingRecurrent
container. The basic idea is simple:Recurrent
model object, just like otherstheano.scan
over all time steps, except the step function being defined by added layers.LSTM, GRU and other gating mechanisms would have to be reduced to activation layers (which they are). I think this would also help construction of things like Highway Networks.
I dont have a ton of experience in RNNs. So does this make any sense? Would it be useful? My understanding is that it'd really help support the wide variety of RNN based neural networks cropping up almost everyday. RNNs in 2015 are Conv Nets in 2013.