keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.91k stars 19.45k forks source link

The Recurrent Model #443

Closed pranv closed 7 years ago

pranv commented 9 years ago

Hey,

I've been thinking about recurrent networks and how to implement them. I've looked at keras's implementation and some other libraries. What I found (or understood) was that most offer a standard step function - be it LSTM or GRU or something else. One can usually vary the number of inputs and outputs and that's about it.

So I started thinking about how to make arbitrary amount and type of processing per time step, which led me to think about a Recurrent model and the corresponding Recurrent container. The basic idea is simple:

  1. Initialize a Recurrent model object, just like others
  2. Add all required layers
  3. When run, it uses theano.scan over all time steps, except the step function being defined by added layers.

LSTM, GRU and other gating mechanisms would have to be reduced to activation layers (which they are). I think this would also help construction of things like Highway Networks.

I dont have a ton of experience in RNNs. So does this make any sense? Would it be useful? My understanding is that it'd really help support the wide variety of RNN based neural networks cropping up almost everyday. RNNs in 2015 are Conv Nets in 2013.

jmhessel commented 9 years ago

I think this is a great idea! I posted about this in the google group, too. I think a set of layers that support recurrent outputs would be awesome (or even a Recurrent model). For instance, I believe the current LSTM implementation backprops error derived from the last output on some classification task. While Keras does support outputing sequences at each recurrent timestep, to my knowledge, it doesn't support much other than simply feeding that into another RNN.

In practice, a lot of algorithms use the output at every timestep for the loss function directly. The way to train language models in Keras seems a bit strange, as a result. The example's I've seen are trained to predict the last character/word in a sequence, rather than at every timestep. The thing that a lot of researchers do is train RNNs to predict the next output at each timestep, and the final loss function is a sum over all timesteps.

The LSTM in http://arxiv.org/pdf/1411.4555.pdf is a good example of this. At each timestep, a softmax is applied to the output, and the next word is predicted. The model's total loss function is designed to be a sum over all timesteps. I don't think something like this is currently possible, though I am fairly new to using Keras, so I could be wrong.

fchollet commented 9 years ago

The model's total loss function is designed to be a sum over all timesteps. I don't think something like this is currently possible, though I am fairly new to using Keras, so I could be wrong.

It is definitely possible, you just have to return the full sequences and set your targets to be sequences as well.

The recurrent model idea sounds interesting, but I'm not sure I get it. Here are a few questions:

pranv commented 9 years ago

what are the use cases?

Recurrent Convolution Network, or a really deep LSTM (with more complex gating functions). Maybe, in future, if we have some layer like AttentionWindow (I'm looking into it as well), it can be added to create common encoder-decoder pair with attention (which has given state of the art performance in a lot of tasks).

what would it make possible that would be impossible without (keep in mind that we will introduce the generic TimeDistributed layer soon)?

The aim is to have normal layers play a role in recurrent environment, without any considerations of a Recurrent environment

what is the role of our existing recurrent layers in this system?

The existing layers can be added to this model as well. But having gating mechanisms as activations could make them a bit out of order.

Edit: accidentally hit comment without completing. Fixed it.

fchollet commented 9 years ago

The aim is to have normal layers play a role in recurrent environment, without any considerations of a Recurrent environment

Wouldn't that be handled by the TimeDistributed layer?

jmhessel commented 9 years ago

Of course, my mistake -- apologies.

pranv commented 9 years ago

As far as I understand, you're talking about stacking recurrent layers. I'm suggesting that we make each recurrent step a combination of computation of multiple layers.

On Mon, Jul 27, 2015 at 10:39 AM, jmhessel notifications@github.com wrote:

Of course, my mistake -- apologies.

Reply to this email directly or view it on GitHub: https://github.com/fchollet/keras/issues/443#issuecomment-125087639

pranv commented 9 years ago

See models like these for more reference:

iskandr commented 9 years ago

@pranv Does that first link use a deep network for gating? It's hard to tell from skimming exactly what's getting built in https://github.com/kyunghyuncho/dl4mt-material/blob/master/session2/nmt.py#L464.

Anyway, I think the idea of having arbitrary user-defined networks for computing gating values is interesting, has that come up in any papers?

pranv commented 9 years ago

@iskandr I think all new RNNs can be implemented this way. A quick search on arxiv led to these papers by DeepMind:

See these other papers as well:

Basically any new RNN type, that doesn't simply use stacked RNNs, can be implemented in this approach.

pranv commented 9 years ago

I think I can come with an implementation with a few hours of coding, based on some knowledge I've gained seeing the implementation of Graph models.

Would you like me to try?

elanmart commented 9 years ago

I think it's a great idea. I thought about something similar in the context of sequence generators. Many new RNNs (like attention-enhanced multi-layer decoders) cannot be easily implemented in Keras, which is quite unfortunate.

@pranv are You looking into this? Maybe You could use some help?

RNNs in 2015 are Conv Nets in 2013.

Totally agree.

pranv commented 9 years ago

@elanmart I was hoping to get a 'go' from someone. Now that you have, I'll start work on it.

You looking into this? Maybe You could use some help?

Definitely

elanmart commented 9 years ago

Do you have any concrete ideas about how the model should work and what should be implemented?

Also I think that if Keras is to support continuus inputs (e.g. for language models) the main Model class has to be modified to allow shuffling the training data so that _rowi of batch k-1 is the predeccesor of _rowi of batch k in the original input space.

pranv commented 9 years ago

I think we should have a model that accepts a graph. Rather than just one feed forward pass off the data, it's run over all time steps. Its very similar to other RNN models in keras, except that the step function is a container of sorts, defined by the user.

I think LSTMs and GRU would have to be reduced to activations.

@elanmart I didn't get the shuffling part though

pranv commented 9 years ago

@fchollet I need some advice to go ahead with this. I'll explain what I'm trying to achieve with this kind of model.

Consider you want to build a DRAW like model. The DRAW model has 4 unique parts:

It also has a sampler and other things, but let's just skip it for now.

Reader and Writer are attention windows, and that can be implemented as layers fairly easily. Encoder and Decoder are themselves Recurrent, and the whole model generated by these 4 components is also recurrent. There are MLPs before each RNN as well.

And other encoder-decoder architectures are similar - to my knowledge.

My idea to enable implementation of these kind of models was to have a Recurrent model and a corresponding container. This perticular implementation would be defined by:

  1. A recurrent container with a Dense layer and an LSTM as an activation which will act as encoder and decoder
  2. These containers are added to the Recurrent model, along with attention windows.

Now, I have 2 simple questions:

  1. Do you see this as a valuable thing?
  2. How would you implement this. I keep getting back to a design where in the model or container is just a wrapper for theano.scan. What would the model/container accept? Graph containers? Layers?
elanmart commented 9 years ago

@pranv I've been thinking about this as well.

I think we should have a Recurrent model+container. It could be trained on it's own or used as a Layer, which is crucial for example in caption generation.

The Recurrent model should be as flexible as possible, therefore it should allow:

Also, about statefullness: I think every model (container) should have a list of shared variables: internal_states. Of course the updates would have to be passed to the model by each layer with statefull == True and then added to the updates returned by optimizer.

I think Recurrent model is essential for Keras, since currently implemented recurrent layers are not really usefull for new architectures.

pranv commented 9 years ago

@elanmart Agreed!

I just need to wrap my head around what the user actually sees and how it works - as a coherent system, much like Sequential and Graphs.

Also, I don't think that there are other libraries that do this well. Thus, this is really really important

elanmart commented 9 years ago

@pranv I agree, other libraries are also defficient in this matter.

Let's see. I think that:

Then we need to somehow plug this into scan.

Edit: we would also somehow have to handle the prediction, where output is fed back as input step by step.

pranv commented 9 years ago

@elanmart

Recurrent model will accept single layers, just as Graph does

Won't this make the two model's code very redundant?

Every time a layer is added, user specifies: how is it computed given input and how is it computed given all previous internal_states.

Do we need internal states or is it taken out as output and Refed as input? Also, a way to split inputs will be necessary if we go with the latter choice.

Is the output of this layer statefull and returned to parent Model.

This is a good idea. State-full is better than not.

fchollet commented 9 years ago

Do you see this as a valuable thing?

I'm still not sure what this container/model could do that the upcoming TimeDistributed layer wouldn't do. TimeDistributed would be a layer-level implementation of what you're describing (a wrapper for scan). It allows to use any Keras layer in a recurrent architecture (but layers that make use of past states will still need to be implemented as recurrent layers, like today).

Adding only a layer to the codebase is much simpler and has a much lower risk of issues than adding an entire new container. What is the justification for the model?

I think the fundamental issue here is the notion of computational containment. In Keras all computation happens in layers, which makes it easy to write, understand and debug. This model seems to be proposing to distribute computation outside of layers as well, in the container (which is supposed to be a dumb data router). Specifically the following:

Every time a layer is added, user specifies: how is it computed given input and how is it computed given all previous internal_states.

How would you implement this. I keep getting back to a design where in the model or container is just a wrapper for theano.scan.

Why not a wrapper for theano.scan? What is this supposed to do that scan doesn't?

What would the model/container accept? Graph containers? Layers?

Containers and layers have the same API. New containers can be built using either layers or Sequential/Graph containers (with Graph containers being limited to 1 input and 1 output in this use case).

pranv commented 9 years ago

I'm still not sure what this container/model could do that the upcoming TimeDistributed layer wouldn't do. TimeDistributed would be a layer-level implementation of what you're describing (a wrapper for scan).

This is not aimed at making each layer TimeDustributed. This is much more like making the Graph container time distributed. Take the above example of DRAW architecture, and how would you implement such a thing using TimeDistributed layers?

Point: The exact computation at each _step of RNN is a combination of multiple layers worth of computation.

If the Graph container is being made Time Distributed, similar things can be achieved with it

fchollet commented 9 years ago

If the Graph container is being made Time Distributed, similar things can be achieved with it

Using the TimeDistributed layer, any Graph (1-input and 1-output) or Sequential container can be made time-distributed. It would also be possible to extend the API to generalize the application to multi-output, multi-input Graphs as well. Does this answer your concerns?

pranv commented 9 years ago

Does this answer your concerns?

Yes. This idea will be unnecessary and redundant then.

fchollet commented 9 years ago

Would it? To be honest, I'm still not sure I get it, so I can't be sure that the TimeDistributed layer (again, it would just be a wrapper for scan) is really equivalent to what you propose. So, are you sure about it?

In any case, it's important that you understand why the TimeDistributed layer could also apply to any container (as long as the 1-input 1-output rule is respected): because layers and containers implement the same API. They are completely interchangeable. In the future we might get rid of the 1-input 1-output rule by extending the layer API to multiple inputs and outputs (like for Graph in the general case).

pranv commented 9 years ago

@fchollet, I'll post some code for a architecture based on a hypothetical recurrent model. Maybe it'll help you understand what we're trying to do with this.

pranv commented 9 years ago

I'll keep the issue open so that maybe someone else has a take on this.. @wxs, @soumith please do have a look at this.

pranv commented 9 years ago

@fchollet This in Torch does something similar - https://github.com/Element-Research/rnn/blob/master/README.md#recurrent

pranv commented 9 years ago

A DRAW implementation in theano at https://github.com/jbornschein/draw.

@recurrent(sequences=['u'], contexts=['x'],
               states=['c', 'h_enc', 'c_enc', 'z', 'kl', 'h_dec', 'c_dec'],
               outputs=['c', 'h_enc', 'c_enc', 'z', 'kl', 'h_dec', 'c_dec'])
    def apply(self, u, c, h_enc, c_enc, z, kl, h_dec, c_dec, x):
        x_hat = x-T.nnet.sigmoid(c)
        r = self.reader.apply(x, x_hat, h_dec)
        i_enc = self.encoder_mlp.apply(T.concatenate([r, h_dec], axis=1))
        h_enc, c_enc = self.encoder_rnn.apply(states=h_enc, cells=c_enc, inputs=i_enc, iterate=False)
        z, kl = self.sampler.sample(h_enc, u)

        i_dec = self.decoder_mlp.apply(z)
        h_dec, c_dec = self.decoder_rnn.apply(states=h_dec, cells=c_dec, inputs=i_dec, iterate=False)
        c = c + self.writer.apply(h_dec)
        return c, h_enc, c_enc, z, kl, h_dec, c_dec

The whole thing is recurrent. The in the hypothetical Keras equivalent of the encoder segment:

model = Recurrent()
model.add_input(name="x")

model.add_states(['c', 'h_enc, 'c_enc', 'z', 'kl', 'h_dec', 'c_dec']) #states are passed to each recursive call

model.add(Activation('sigmoid'), input = 'c', name='sig_c')
model. add(ErrorLayer(), inputs=['x', 'sig_c'], name='x_hat')

model.add(AttentionReader(), inputs=['x', 'x_hat', 'h_dec'], name='r')
model.add(Dense(num_inputs, num_outputs), inputs = ['r', 'h_dec'], merge_mode='concat')
model.add(LSTM(), name = h_enc) # some modification to LSTM layer could be needed to return hidden state..

model.add(Sampler(), inputs = [h_enc, u], name = 'z')

This is really crude. But I hope I could convey something

elanmart commented 9 years ago

@fchollet considering a basic example of soft attention*, I can't really see how TimeDistributedDense would be usefull.

When computing hidden state h for our RNN, we'd like to rely on a context vecotr c, which itself depends on h_tm1. I don't think this can be done with current Layers + propsed TimeDistributedDense?

I'd love to hear Your thoughts on this, I might not fully understand how the TimeDistributedDense would work.

*paper, sec. 3.1

fchollet commented 9 years ago

This is really crude. But I hope I could convey something

Actually this attempt at sketching out an API is 10x more informative than the previous discussion. I think I get it now. Thanks!

I do believe this can be implemented via Graph models and TimeDistributedDense (as could any other recurrent model). The general mechanism is that all processing that is synchronous (every timestep is processed in the same way, stochastically) will be wrapped into TimeDistributedDense, and every processing that is non-synchronous (dependency on past time steps, i.e. recurrence) would have to happen inside a recurrent layer. In this example we will have to develop custom recurrent layers to make DRAW work.

However, I do believe that what you propose has value, essentially UX value. It introduces the notion of state, which is potentially useful. Also it would allow to create advanced recurrent models without having to develop as many custom recurrent layers (however the complexity will simply be passed on to the model architecture, so it's not clear yet if it's a gain).

When computing hidden state h for our RNN, we'd like to rely on a context vecotr c, which itself depends on h_tm1. I don't think this can be done with current Layers + propsed TimeDistributedDense?

You would have to develop a Graph-like recurrent layer. It's no big deal.

fchollet commented 9 years ago

TL;DR: I think this is interesting, I'll think about it. Feel free to develop mock-APIs or actual code if you feel like it (though there is no guarantee that we will use it...).

hugman commented 9 years ago

@pranv Did you try to implement DRAW-attention or soft-attention what @elanmart mentioned using keras API?

I checked Cho's soft-attention code ( https://github.com/kyunghyuncho/dl4mt-material ) I am trying to convert it using Keras API, but It is not trivial to do it.

Any help and advice?

elanmart commented 9 years ago

Hi. I was thinking about the Recurrent model last weekend. I think it would be nice to have it in Keras, since it could be backend-agnostic and ideally should be more convenient than implementing custom layers.

Regardles if @fchollet agrees or not, as an excercise (and for fun) I decided to implement it. Before implementing the Recurrent model itself, I'd like to first get a working multi-io layers though, since I think they would be extremely nice to have.

The actual code is in very early stage, but the initial API I came up with goes something like this:


# --------------------- Create the encoder using standard Keras' functionality
encoder = Sequential()

encoder.add(Embedding(VOCAB_SIZE, 512))

encoder.add(GRU(input_dim=512, output_dim=512,
                bidircetional=True, return_sequences=True, mode='concat'))

encoder.add(GRU(input_dim=512, output_dim=512,
                bidircetional=True, return_sequences=True, mode='concat'))

# --------------------- Initialize the decoder
decoder_core = Recurrent()
decoder_core.add_input('encoded_sentence', sequence=False)
decoder_core.add_input('x', sequence=True)

decoder_core.add_state(name='h1', size=512, init='orthogonal', 
                       learnable=True, statefull=False)
decoder_core.add_state(name='h2', size=512, init='orthogonal', 
                       learnable=True, statefull=False)

# --------------------- Compute attention vectors
attention = Graph()
attention.add_input('conditioner')
attention.add_input('attended')

attention.add_node(RepeatConcat(), 
                   inputs=['conditioner', 'attended'], 
                   merge_mode='mapping',
                   input_map={'2d':'conditioner', '3d':'attended'},
                   name='concatter')

attention.add_node(TimeDistributedDense(1024, 256, activation='tanh'), 
                   input='concatter', 
                   name='att_hidden_layer')

attention.add_node(TimeDistributedDense(256, 1, activation='tanh'),
                   input='att_hidden_layer',
                   name='att_word_scorer') 

attention.add_node(Permute(drop_axes=2, dimshuff=(0, 1,)), 
                  input='att_word_scorer',
                  name='att_dimshuffle')

attention.add_node(Activation('softmax'),
                  input='att_dimshuffle',
                  name='att_softmax')

attention.add_node(WeightedSum(), 
                   inputs=['att_softmax', 'attended'], 
                   merge_mode='mapping',
                   input_map={'weights':'att_softmax', 'objects':'attended'}
                   name='context_creator')

attention.add_output(name='contexts', input='context_creator')

# --------------------- Compute the first hidden layer's state
h1 = Graph()
h1.add_input('x')
h1.add_input('h_tm1')
h1.add_input('contexts')

h1.add_node(Dense(input_dim=(EMB_SZ + HID_SZ + ATT_SZ),
                  output_dim=256,
                  activation='GRU'), 
            inputs=['x','h_tm1','contexts'],
            merge_mode='concat',
            name='h2h')

h1.add_output(input='h2h',
              name='hid_update')

# --------------------- Compute the second hidden layer's state
h2 = Graph()
h2.add_input('x')
h2.add_input('h_tm1')

h2.add_node(Dense(input_dim=(2 * HID_SZ),
                  output_dim=256,
                  activation='GRU'), 
            inputs=['x','h_tm1'],
            merge_mode='concat',
            name='h2h')

h2.add_output(input='h2h',
              name='hid_update')

# --------------------- Compose the core decoder
decoder_core.add_node(attention,
                     input_map={'conditioner':'h2_tm1',
                                'attended':'encoded_sentence'},
                     name='attention')

decoder_core.add_node(h1,
                     input_map={'x':'encoded_sentence',
                                'h_tm1':'h1_tm1',
                                'context':'attention'},
                     name='h1')

decoder_core.add_node(h2,
                     input_map={'x':'h1',
                                'h_tm1':'h2_tm1'},
                     name='h2')

# --------------------- Build the model used for training
decoder_learner = Sequential()
decoder_learner.add(decoder_core(return_sequences=True))
decoder_learner.add(TimeDistributedDense(input_dim=512,
                                       output_dim=TARGET_VOCAB_SZ,
                                       activation='softmax'))

decoder_learner.compile(optimizer=RMSProp(), loss='categorical_crossentropy', class_mode='categorical')

# --------------------- Build the model used for sampling
# Connect recurrent architectures vs build this shit later by copying weights from decoder_core?
decoder_sampler = Recurrent()
tttwwy commented 8 years ago

@elanmart Is there any update about your keras based nmt model code? looking forward to it.

Sri-Harsha commented 8 years ago

@elanmart Can you provide me suggestion on how to implement sequence classification, where each sequence have different length without padding in keras

jadore801120 commented 8 years ago

@Sri-Harsha I think padding is unavoidable in Keras, if I'm not mistaken. But if you are using an Embedding layer (Embedding), there is a mask_zero argument you can set to true to mask out the padding values. It shall help you to deal with sequences with varying lengths.

jmhessel commented 8 years ago

@jadore801120 I think you could have a separate model for each possible sequence length that share embedding/LSTM weights. Each batch could be formed from sequences of only a given length, and then a train_on_batch could be called for the corresponding model of the appropriate length. Masking might be more reasonable, though.

reachbp commented 7 years ago

@elanmart Hi, I wanted to know if you were able to complete the Encoder-decoder network with Attention? Im stuck on how to add a Sequential layer to encode multiple sentences of a Document into sentence vectors. I'm referring to the first two layers of the below image. Would you have any leads. !

screen shot 2016-12-18 at 12 05 25 pm