"real time" recurrent nets

lemuriandezapada commented 9 years ago

Hey guys,

I was wondering how are the initial internal states in a recurrent layer dealt with? So far it appears they are reset are every run. Is there any way to preserve them?

I'd like to be able to feed a .predict_proba() function data one time step at a time for a time series task, as the points come in, without also feeding the entire history all over again. Is this somehow possible? Thanks

ssamot commented 9 years ago

Adding on top of the comment above - it should be possible to train incrementally as well, if you can keep state. Otherwise one would possibly need to remember really large sequences and replay them if training online.

fchollet commented 9 years ago

I agree that we need some form of stateful RNNs, that would be particularly useful for sequence generation. It's unclear yet if that should be setup as a mode of existing RNNs, or as different layers altogether.

Anybody interested in looking at it?

ssamot commented 9 years ago

Well - my personal preference would be to use the same layer - something along the lines of stateless = true by default. You would only need to somehow preserve h_tm1, c_tm1 from calls between train - right?

ssamot commented 9 years ago

OK - I'll try doing an implementation/test with simple character generation and use the old state to init ``theano.scan'. Let's see...

ssamot commented 9 years ago

This seems to require possibly a much deeper re-write than I anticipated initially - whatever theano.scan uses as initial input seems to be compiled once and stays there - one might need to save state in shared variables somehow? Any ideas?

lemuriandezapada commented 9 years ago

Is there no way to read or set the internal states by hand? I wouldn't really touch the training procedure, but the running/prediction procedure needs to have some sort of persistent mode.

ssamot commented 9 years ago

The reading you can do, but not the setting - AFAIK - the scan operation is symbolic and the internal state is re-initialised after every pass. I cannot see how the state can be set manually in an easy manner.

lemuriandezapada commented 9 years ago

That's a bummer. This implies a reading of the weights and reimplementing the whole nets in a "home", slower, numpy manner.

vzhong commented 9 years ago

Can't you set the initial internal states through a shared variable? For example here's the Theano example of the recurrence for a vanilla RNN:

        def recurrence(x_t, h_tm1):
            h_t = T.nnet.sigmoid(T.dot(x_t, self.wx)
                                 + T.dot(h_tm1, self.wh) + self.bh)
            s_t = T.nnet.softmax(T.dot(h_t, self.w) + self.b)
            return [h_t, s_t]

        [h, s], _ = theano.scan(fn=recurrence,
                                sequences=x,
                                outputs_info=[self.h0, None],
                                n_steps=x.shape[0])

        p_y_given_x_sentence = s[:, 0, :]
        y_pred = T.argmax(p_y_given_x_sentence, axis=1)

In this case can't you change the initial hidden state by setting self.h0?

ssamot commented 9 years ago

Yes of course - but h is symbolic right? You can do something like `self.h0=shared_zeros(shape = (1, self.output_dim)) to create the shared variable - how do you set h0 = h?

so you can't do: `

    f = theano.function(inputs=[self.h0], outputs=outputs})
    self.h0.set_value(f()[-1])

` (updated for clarity)

fchollet commented 9 years ago

Anything new on this front? I will try to repro Karpathy's RNN experiments in Keras, and add everything that's needed in the process.

ssamot commented 9 years ago

Just an addition - I am currently struggling to come up with a good API for this. If you are to do state properly, apart from keeping state, you will need a mask of some sort that will tell you when to keep the current activation unchanged and possibley ignore padded elements.

Thus you would need to change get_output(self, train) to something like get_output(self, train, mask_train, mask_run) where each mask would be a 3d tensor with possible values for each element 0-1, first for train, 0-1 for keep hidden activations or not. This would change the overall interal API - does it make sense?

fchollet commented 9 years ago

Thus you would need to change get_output(self, train) to something like get_output(self, train, mask_train, mask_run) where each mask would be a 3d tensor with possible values for each element 0-1, first for train, 0-1 for keep hidden activations or not. This would change the overall interal API - does it make sense?

If you need a mask, why not make it an attribute of the layer? Then there would be no need to change the overarching API. But to be honest I am not sure I see what you are describing --could you provide more details?

ssamot commented 9 years ago

You cannot make it an attribute of the layers unless there is another way to get the batch you are sending. Imagine a scenario where you have to: read 10 characters then output a single character, keep the state and output another character after receiving three more characters.

If you don't have a mask you would have to have a padded 3d tensor - not that efficient. Does this make more sense now?

jonilaserson commented 9 years ago

A. I think it would help if you write what use-cases of stateful RNNs you would like to be able to model.

B. I'm not sure I see the problem in the example you've given. Why can't you output a 'null' character every time you don't have a character to output, and treat it as a character-to-character RNN?

C. Again, I'm not sure if that was the issue you were facing, but if it was about making predictions on a batch, then here is a thought: maybe it is ok to allow only one input sequence, instead of a batch, when doing prediction (feedforward) on stateful RNNs? The training can still be in batch because you can provide the mask in advance.

On Thu, Jun 11, 2015 at 9:19 PM, ssamot notifications@github.com wrote:

You cannot make it an attribute of the layers unless there is another way to get the batch you are sending. Imagine a scenario where you have to: read 10 characters then output a single character, keep the state and output another character after receiving three more characters.

If you don't have a mask you would have to have a padded 3d tensor - not that efficient. Does this make more sense now?

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/98#issuecomment-111230677.

ssamot commented 9 years ago

B. I'm not sure I see the problem in the example you've given. Why can't you output a 'null' character every time you don't have a character to output, and treat it as a character-to-character RNN?

What's the null character for audio? (well yes it's probably some kind of silence, but you can see the point). In fact your network might go through a phase of trying to predict this "silence". Padding of any kind is horrible - it leaves artifacts all over the place, makes the training harder and produces unexpected behaviour.

My point is that you need to be able to control the state of the RNN - when it should be updated, when not, when there are things to learn - in a generic fashion. Otherwise you will be changing the API from usecase to usecase.

Lasagne tries to do something along this lines, but does not go far enough in my view - see get_output_for:

https://github.com/craffel/nntools/blob/recurrent/lasagne/layers/recurrent.py

fchollet commented 9 years ago

Can you give a code example of what you are trying to do, along with an example? Just a generic outline would be enough, not working code.

ssamot commented 9 years ago

Let's assume you are given this simple string:

"Bob moved to the bedroom. Where is bob? Bedroom. Bob moved to the Garden. Where is Bob? Garden"

Suppose now you want to learn to answer the questions pose above. As it currently stands in keras you would need to create training examples like this:

Training examples: X, y ["Bob moved to the bedroom. Where is bob? "],[ "B"] ["Bob moved to the bedroom. Where is bob? B"], ["e"] ["Bob moved to the bedroom. Where is bob? Be"],[ "d"] ["Bob moved to the bedroom. Where is bob? Bed]",[ "r"]

...

["Bob moved to the bedroom. Where is bob? Bedroom. Bob moved to the Garden. Where is Bob? Garde" ],[ "n"]

In the current keras implementation you would need to pad almost everything to the maximum length with some "null" characters. Notice that you don't care about predicting the next character - only the ones that come after a question mark. You will also be outputting a sequence of unknown length incrementally by outputting characters one - by - one.

Maybe there is an obvious way of solving this and I cannot see it - dunno. How would you handle this?

jonilaserson commented 9 years ago

Can't you use keras to train the network on the samples above by providing this training set?

x = "Bob moved to the bedroom. Where is Bob?bedroom" y = "**bedroom."

x = "Bob moved to the garden. Where is Bob?garden" y = "*****garden."

On Fri, Jun 12, 2015 at 1:14 AM, ssamot notifications@github.com wrote:

Let's assume you are given this simple string:

"Bob moved to the bedroom. Where is bob? Bedroom. Bob moved to the Garden. Where is Bob? Garden"

Suppose now you want to learn to answer the questions pose above. As it currently stands in keras you would need to create training examples like this:

Training examples: X, y ["Bob moved to the bedroom. Where is bob? "],[ "B"] ["Bob moved to the bedroom. Where is bob? B"], ["e"] ["Bob moved to the bedroom. Where is bob? Be"],[ "d"] ["Bob moved to the bedroom. Where is bob? Bed]",[ "r"]

...

["Bob moved to the bedroom. Where is bob? Bedroom. Bob moved to the Garden. Where is Bob? Garde" ],[ "n"]

In the current keras implementation you would need to pad almost everything to the maximum length with some "null" characters. Notice that you don't care about predicting the next character - only the ones that come after a question mark. You will also be outputting a sequence of unknown length incrementally by outputting characters one - by - one.

Maybe there is an obvious way of solving this and I cannot see it - dunno. How would you handle this?

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/98#issuecomment-111291284.

ssamot commented 9 years ago

For a start it would have to be something like this:

in your second training sample.

x = "Bob moved to the bedroom. Where is Bob?bedroom. Bob moved to the garden. Where is Bob?garden" y = "**garden."

Suppose now Bob had done 50K irrelevant things before going to the garden and after going to the bedroom. Where does the padding stop? How many different copies would you need for something that can be essentially treated as maximum two training examples? (actually the same thing applies when trying to predict the next character - of course you can do it without controlling state, but it would very inefficient.)

jonilaserson commented 9 years ago

I see. So the answer might need to be a seq-to-seq LSTM (where LSTM1 encodes the story followed by a question, and LSTM2 receives the last state of LSTM1 and decodes the answer).

LSTM1 could be many-inputs-to-1-output LSTM (which Keras has). LSTM2 could be many-to-many LSTM, but the problem is how to plug the input to LSTM2.

Also you would probably want a multi-layer LSTM, so it will be a diversion from the Sequential model, since you will be adding both a layer on top and a layer "to the right".

On Fri, Jun 12, 2015 at 11:41 AM, ssamot notifications@github.com wrote:

For a start it would have to be something like this:

in your second training sample.

x = "Bob moved to the bedroom. Where is Bob?bedroom. Bob moved to the garden. Where is Bob?garden" y = "**garden."

Suppose now Bob had done 50K irrelevant things before going to the garden and after going to the bedroom. Where does the padding stop? How many different copies would you need for something that can be essentially treated as maximum two training examples? (actually the same thing applies when trying to predict the next character - of course you can do it without controlling state, but it would very inefficient.)

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/98#issuecomment-111415594.

lemuriandezapada commented 9 years ago

You can just use the "repeatvector" layer in Keras to provide the LSTM1 state to LSTM2.

As for getting the neuron state, how about treating it like a sort of separate output/input that you can just set and retrieve? If it's too hard to have proper persistent states, maybe a different kind of layer with an extra sort of input and output that stands for the internal state can be concocted... just an idea.

wxs commented 9 years ago

@vzhong @ssamot For interest: I am persisting state in the SimpleRNN on one of my projects by setting up the state as a shared variable, adding a _scan output that tracks iteration count, and then integrating the initial state thus.

def __init__(self, ...):
    ...
    self.h0 = shared_zeros((self.output_dim))
...
def _step(self, x_t, h_tm1, n, h0, u):
    return self.activation(x_t + T.dot(h_tm1 + T.switch(T.eq(n,0),h0,T.zeros_like(h0)), u)), n+1
...
outputs, updates = theano.scan(
        self._step,
        sequences=x,
        outputs_info= [T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dim), 1), 0],
        non_sequences=[self.h0, self.U], # static inputs to _step
        truncate_gradient=self.truncate_gradient
    )
if self.return_sequences:
    return outputs[0].dimshuffle((1,0,2))
return outputs[0][-1]

I could not pass the initial state self.h0 directly in to the outputs_info as you were attempting because that seems to expect to be passed a constant. This was the solution I came up with although it's not exactly pretty.

fchollet commented 9 years ago

Let's see. You note the (shared variable) state self.h0.

When computing the state for the next timestep, you use as past state:

h_tm1 + T.switch(T.eq(n, 0), h0, T.zeros_like(h0))

So if n == 0 (first timestep), return h_tm1 + h0, else return h_tm1. Since h_tm1 starts at 0, it means that at the first iteration you initialize the state to self.h0[:, 0, :].

Shouldn't that be self.h0[:, -1, :] instead? I.e. the last memory from the previous iteration? Or maybe I'm not correctly following your approach?

wxs commented 9 years ago

@fchollet sorry I should have been a bit more clear.

I needed initial hidden state to be a parameter of the model that gets optimized over, this is a solution for that, hence initializing the state to h0 (a shared variable). In my case I was not sequentially feeding in timesteps.

However I posted this as an example of how I solved the problem of how to control the initial state of the RNN. This technique could be used for the sequential input issue with a minibatch size of 1: you could use a similar approach to set h0 for the next batch (but of course if you're doing that, don't put h0 in the parameters list so it doesn't get updated by SGD).

I posted here partly because @vzhong was having trouble passing in a shared variable to the scan function, this is an example of how it could be done.

donglixp commented 9 years ago

In Lasagne, they implemented a new recurrent layer with mask. link: https://github.com/craffel/nntools/blob/recurrent/lasagne/layers/recurrent.py

fchollet commented 9 years ago

@donglixp is it stateful?

fchollet commented 9 years ago

So, does anybody have a solution regarding making recurrent layers stateful, at this time?

lemuriandezapada commented 9 years ago

I would also like to know. It's quite an important issue. Do other frameworks have a solution to this?

lemuriandezapada commented 9 years ago

I take it it can't really be done then. Safest bet is to just export the weights and reimplement the computation in python?

fchollet commented 9 years ago

I take it it can't really be done then.

Why? Just because nobody has posted a solution so far doesn't mean it can't be done. I haven't really looked into it myself so far.

fchollet commented 9 years ago

Let's consider the batch behavior of a stateful LSTM layer:

should it consider that each successive sample is the batch is appearing in chronological order, and therefore process them one after the other (with memory reuse) instead of processing each sample in parallel?
or should it consider that the samples in a batch are independent, but that the next batch will provide the samples that come chronologically next (i.e. batch_2[i] is the successor to batch_1[i] for all i)?

If the former, then it should be implement as an entirely new layer. If the latter, then it can be implemented as an option in the existing layer.

wxs commented 9 years ago

Perhaps I'm missing something here: after I do a weight update the correct hidden states for that batch's input would change. So my previously calculated hidden states are no longer correct as input to the next batch. So unless you don't do a weight update every mini-batch you can't bring the state forward to the next batch.

fchollet commented 9 years ago

@wxs the fact that the weights of the network are continuously changing ("continuous" is the keyword here: each batch brings an epsilon change) certainly does not negate the notion of state. In the same way that, in a FFWD network, after you applied some update to layer 1, the following layer 2 can still provide accurate predictions from the output of layer 1...

lemuriandezapada commented 9 years ago

Yes, the weight change doesn't have much to do with the state. Momentum for example uses the gradient from the previously already changed weights to contribute to the current gradient. So the fact that they can change shouldn't have much of an impact.

@fchollet I think the dilemma in this thread was 'real time' network execution, not network training. Training could for now remain the same instead of also having a real time kind of implementation.

If you want to do real time training then the second option would make more sense, imo.

elanmart commented 9 years ago

I think Blocks has an implementation of sequence generators, maybe it would be helpful for You guys to look into it? Just an idea, I don't know Theano well enough yet to contribute anything constructive.

kylemcdonald commented 9 years ago

i think this is the implementation @elanmart is referring to https://github.com/mila-udem/blocks/blob/master/blocks/bricks/recurrent.py#L226-L233 but comparing it to keras i'm also unable to contribute anything constructive :)

fchollet commented 9 years ago

or should it consider that the samples in a batch are independent, but that the next batch will provide the samples that come chronologically next (i.e. batch_2[i] is the successor to batch_1[i] for all i)?

Let's go with this behavior, and let's implement it as an option in existing recurrent layers (stateful keyword argument in constructor, False by default).

I believe this can be easily achieved by simply storing the last output and last memory at the end of a batch (e.g. as class attributes of the layer?), then passing these as outputs_info in the scan loop of the next batch. Remarks, concerns?

ssamot commented 9 years ago

@fchollet Just a clarification - I assume the batches are user controlled - the networks is going to save state after x timesteps and the next batch will have to come from the user? If this is the case you would need to cycle through batches and the network might forget what it has learned in the past if you train "too much" on a batch. You will also need some kind of layer resetting mechanism (but that's not that hard to do).

In case where you have just a single long input, this might prove slower than your other option, but I think it's a great step forward.

elanmart commented 9 years ago

@kylemcdonald I was thinking about https://github.com/mila-udem/blocks/blob/master/blocks/bricks/sequence_generators.py

harpone commented 9 years ago

IMO 'stateful' should definitely default to True... why would you want to reset the state anyway? I'm assuming the batches are also in chronological order(?)

jramapuram commented 9 years ago

So is this basically a discussion of implementing RTRL? If so just storing the output and last memory are not sufficient. You need to compute the entire derivative dh/dTheta for time t and use it for time t+1 (h=hidden layer & theta = [W, b, recurrence] ). Note that this is the full derivative, i.e. dims = [h x dim[Theta]](i.e. memory intensive).

This is contrary to what is currently being done: I.e. all the recurrences are unfolded and then theano is used to automatically differentiate the entire thing (i.e hence the delay in construction). To be truly 'real time' you either need to implement RTRL or some form of reduced RTRL (eg: some sparse gradient approximation)

harpone commented 9 years ago

Correct me if I'm wrong, but I think shuffle in the Model class should definitely not be True by default for RNNs?

Also, if shuffle='batch', then I guess the statefulness should be False, since the next batch will most likely not be the chronologically next batch (not that it would make a huge difference)?

In any case, I suppose the statefulness problems can be mitigated if the batch size is large enough, 128 sounds a bit small as a default maybe...

BTW, why does the training set need to be shuffled at all for RNNs? IMO it would make sense to go through all of the data in chronological order, and maybe just pick random batch lenghts to get an SGD-like stochasticity effect...

harpone commented 9 years ago

@jramapuram no I don't think this has anything to do with RTRL (although I think that would be a cool training method).

wxs commented 9 years ago

@harpone As an example of shuffling with RNNs: I train RNNs on short snippets of text data like tweets, which means I don't need to split time across multiple training rows. So shuffling is quite meaningful there.

harpone commented 9 years ago

@wxs OK good point... I was thinking of general time series data.

lemuriandezapada commented 9 years ago

Well this is becoming quite advanced. When I first posted the issue it was not really about RTRL but more about doing stateful "predict" after the network was trained in a standard manner.

ssamot commented 9 years ago

@jramapuram This is correct, which basically means you need to save (|W| + |b| )* |h| variables, before moving to the next batch.

@lemuriandezapada A stateful predict would be much easier, albeit it wouldn't be able to cover the most interesting use-case of training online (e.g., for sequence generation).

elanmart commented 9 years ago

@fchollet

do you think this can be done without passing the update tuple of hidden state to the parent Model
do you think that sequence generation can be done using existing models / containers?

jramapuram commented 9 years ago

So if someone can help me cross the finish line we can push in RTRL. I have some basic test stuff here but am getting some theano disconnected errors. Here is the main logic:

def get_gradients(self, loss, params, activ):
        intermediary_grads = [theano.gradient.jacobian(a.flatten(), p) for a, p in zip(activ, params)]  # dh/dTheta
        grads = theano.grad(loss, activ) * T.prod(intermediary_grads)  # dE/dh_Final * prod(dh/dTheta)

        if hasattr(self, 'clipnorm') and self.clipnorm > 0:
            norm = T.sqrt(sum([T.sum(g ** 2) for g in grads]))
            grads = [clip_norm(g, self.clipnorm, norm) for g in grads]

        return grads

and I have modified Sequential to pass in updates like this:

        updates = self.optimizer.get_updates(self.params,
                                             self.constraints,
                                             [l.get_output(train=False) for l in self.layers],
                                             train_loss)

Error is :

theano.gradient.DisconnectedInputError: grad method was asked to compute the gradient with respect to a variable that is not part of the computational graph of the cost, or is used only by a non-differentiable operator: Elemwise{mul,no_inplace}.0

Error line is:

grads = theano.grad(loss, activ) * T.prod(intermediary_grads)  # dE/dh_Final * prod(dh/dTheta)

keras-team / keras

"real time" recurrent nets #98