Masks for RNNs - Githubissues

elanmart commented 9 years ago

Hey,

I think it would be cool if we could specify when the recurrent network should stop updating its hidden state. For example if my sequences have max length 100, and a particular example has length of only 10, the network will update its hidden state 90 times before returning the final vector, which is not neccesarly desireable.

fchollet commented 9 years ago

How would you suggest masks are implemented?

For now, you could simply group your samples into batches where all samples have the sample lengths, or even simpler (but slower): use a batch size of 1 (and no zero-padding).

elanmart commented 9 years ago

Well, I've tried both these methods before. Batch of size 1 is indeed too slow, and grouping samples by length is something I don't find too elegant. I'm also not sure if it does not hurt performance, since one is not able to sample the data fully randomly.

I'm not really an expert when it comes to implementing stuff in Theano, but I think people from Lasagne have something like this:

https://github.com/craffel/nntools/blob/master/lasagne/layers/recurrent.py

charlesollion commented 9 years ago

I've worked a bit with masks for RNN, it can be implemented in many different ways. I think it can be quite useful.

If you're interested in the last output only, one easy way is to pass the mask to the step function so that is doesn't compute anything when the mask is 0 (state and output stays the same): step ([...], h_tm1, mask): [...] tmp_h_t = //computation here h_t = (1 - mask) * t_tm1 + mask * tmp_h_t

The input to the whole model is [sequences, masks]. Could also be computed in theano

If interested in the whole output sequence, you also need to compute a masked loss which can be tricky

fchollet commented 9 years ago

If interested in the whole output sequence, you also need to compute a masked loss which can be tricky

The current layers can output both the last output or the entire sequence. We need to have to be compatible with that.

I wonder how much of a bad practice it would be not to keep a separate mask variable, and instead just stop the iteration when an all-0 input is found at a certain timestep. It would make things much easier. What do you guys think?

elanmart commented 9 years ago

I thought about it but it'll work only in case of one example per batch or all examples of the same size, right?

charlesollion commented 9 years ago

If we're sure to always pad with zeros and that no input are all-0 before the end of the sequence that would be OK. Still, you need to carry on the computation for the inputs in the batch that have a larger size, while keeping the result for the 'stopped' ones. Can be done in the step function.

if output sequence is on, you have a batch of sequences of outputs, but some of the output sequences are padded, which is not easy to deal with

wxs commented 9 years ago

I took a stab at this in #239. I'm still massaging it a bit, and it's just in the SimpleRNN for the moment, but I'd be interested to get your feedback.

My issue now is how best to get the mask input to pass it into the SimpleRNN (mine come after an Embedding layer so I need to use Merge to merge them back in. I'm working on that now). @fchollet this would be an issue with what you describe of not keeping a separate mask variable, since after an embedding there would be no all-0 input.

I suppose another option would be to put a constraint on the Embedding forcing it to not be allowed to learn a representation for the "pad" value.

fchollet commented 9 years ago

this would be an issue with what you describe of not keeping a separate mask variable, since after an embedding there would be no all-0 input.

Correct, but how would masking work with Embedding layers in the case of a separate mask parameter?

It would be very easy to rectify the Embedding layer to output all-0 feature vectors for 0-inputs. After the embedding stage, just go over the input indices and when a zero is encountered, set the corresponding feature vector to 0.

This would be compatible with our text preprocessing utils, which assume that 0 is a non-character.

wxs commented 9 years ago

how would masking work with Embedding layers in the case of a separate mask parameter?

I was thinking of either modifying Embedding to optionally pass-through a mask (following the convention that masks are always concatenated along the time dimension), or else using a Merge to concatenate the embedding with the mask.

It would be very easy to rectify the Embedding layer to output all-0 feature vectors for 0-inputs

Hmm, doesn't this introduce a small probability that, for instance, an all 0 vector is learned by the embedding layer which then gets "stuck"? I suppose for high-dimensional vectors that's pretty unlikely. But this could happen at any stage of the network. If a value ever "happens" to hit 0 suddenly its properties change.

Perhaps safer to use, e.g. NaN or -Inf but I don't know how those interact with the GPU.

wxs commented 9 years ago

Also: for a large feature vector isn't it quite inefficient to iterate over the entire vector just to check if it's masked?

fchollet commented 9 years ago

Also: for a large feature vector isn't it quite inefficient to iterate over the entire vector just to check if it's masked?

Yes, but that should still be negligible compared to the matrix multiplications for non-zero vectors.

Regarding the Embedding layer, the fix could be done by adding one line:

self.W = self.init((self.input_dim, self.output_dim))
self.W[0] *= 0. # the embedding of index 0 (non-character) will be an all-zero vector

Then every recurrent layer would check if the current input is all 0, and return a 0 output if that's the case (in the case of LSTM it would also return the previous memory unchanged). Thoughts?

fchollet commented 9 years ago

Hmm, doesn't this introduce a small probability that, for instance, an all 0 vector is learned by the embedding layer which then gets "stuck"? I suppose for high-dimensional vectors that's pretty unlikely. But this could happen at any stage of the network. If a value ever "happens" to hit 0 suddenly its properties change.

I think that's statistically impossible because every value in the feature vector would need to reach exactly zero, starting from a random initialization. Even if all-zero happened to be an optimum in the context of some task, the learned value could end up epsilon-close to all-zero but likely never all-zero.

wxs commented 9 years ago

Then every recurrent layer would check if the current input is all 0, and return a 0 output if that's the case

I think you'd prefer to return h_tm1 here since in your examples and utilities you post-pad with 0 for shorter sequences (or change the examples to pre-pad, I suppose. I think pre-padding makes a bit more sense anyway).

I guess this is a bit easier to understand than concatenating the mask to the input. But potentially more prone to "accidental" bugs, where the user passes in some zero-data without understanding this effect, and gets strange behaviour. Like if this becomes the standard behaviour in all layer types, what if I'm doing a CNN on an image and I happen have a patch of black pixels?

fchollet commented 9 years ago

what if I'm doing a CNN on an image and I happen have a patch of black pixels?

Typically you'll first run your input through a conv2D layer, then run a sequence of output vectors through a recurrent layer. Again, it will be statistically impossible for the processed vectors to be all zero.

I agreed that the behavior seems "dirty", however as long as the behavior is clearly documented we should be fine. And accidental bugs will be so improbable as to be impossible.

The main argument for this setup is that it introduces no architecture issues (the nature and shape of the data being passed around is unchanged) and it is very easy to implement / simple to understand.

fchollet commented 9 years ago

I think pre-padding makes a bit more sense anyway

Agreed on that.

charlesollion commented 9 years ago

If you pre-pad, you could even mask the 0s from the beginning only: when a non zero entry appears in the vector, every following entries are considered and computed, even all-0s. I think then it's the cleanest!

You can compute this mask in the layer computation and pass it to the step; when I get a bit of time I'll try and write that.

wxs commented 9 years ago

I think that's statistically impossible because every value in the feature vector would need to reach exactly zero

Ah wait: what about after ReLU activation? Suddenly getting all 0 becomes significantly more likely (i.e. 1/2^n where n is the feature vector dimension)

fchollet commented 9 years ago

We'll definitely switch to pre-padding (it's a trivial change).

Ah wait: what about after ReLU activation? Suddenly getting all 0 becomes significantly more likely (i.e. 1/2^n where n is the feature vector dimension)

That's right. I think a good solution would be to make the mask value configurable in the Embedding layer and the recurrent layers, much like what XGBoost does. The default could be, for instance, -999.0.

model.add(Embedding(indim, outdim, mask_value=-1.)) # replaces index 0 with all-(-999.) vectors
model.add(SimpleRNN(outdim, outdim, mask_value=-1.)) # skips all-(-999.) vectors

wxs commented 9 years ago

OK @fchollet sounds like you're pretty set on the mask_value approach, which seems fine, you're right that it will be simpler to implement everywhere. Feels slightly "wrong" to me but that's just aesthetic.

I'm happy to implement this, but let me know if you're doing it so we don't dupe work.

Is it confusing that the Embedding input expects 0 as a pad, if everywhere else expects -999 (or whatever) as a pad? Seems a bit inconsistent for the API that on SimpleRNN mask_value would represent what inputs are masked, but on Embedding they would represent how to represent a pad on the output.

fchollet commented 9 years ago

Is it confusing that the Embedding input expects 0 as a pad, if everywhere else expects -999 (or whatever) as a pad? Seems a bit inconsistent for the API that on SimpleRNN mask_value would represent what inputs are masked, but on Embedding they would represent how to represent a pad on the output.

The reason for the discrepancy is that the input of an Embedding is a tensor of indices, which are positive integers. The default convention for the non-character index is 0.

The rest of the network uses an arbitrary mask value (float).

wxs commented 9 years ago

OK I've put up a preliminary implementation at #244, would love some review before I dive in to getting more of the recurrent types supported.

wxs commented 9 years ago

Btw, looks like Bricks takes the approach I did initially, of having a separate channel over which the mask is sent:

http://blocks.readthedocs.org/en/latest/api/bricks.html

wxs commented 9 years ago

The PR implementing masks has now been merged, for those of you watching this issue.

mbchang commented 7 years ago

Yes, but that should still be negligible compared to the matrix multiplications for non-zero vectors. Regarding the Embedding layer, the fix could be done by adding one line: self.W = self.init((self.input_dim, self.output_dim)) self.W[0] *= 0. # the embedding of index 0 (non-character) will be an all-zero vector Then every recurrent layer would check if the current input is all 0, and return a 0 output if that's the case (in the case of LSTM it would also return the previous memory unchanged). Thoughts?

Has this been implemented? I looked in the source code I couldn't find this.

braingineer commented 7 years ago

are you referring to the recurrent pass through, @mbchang? if so, check the backend code. e. g. in the theano back end, there is a switch over 0 for the next hidden state.

wxs commented 7 years ago

@mbchang in general after this discussion Keras ended up moving to a separate explicitly sent mask after all, rather than a special masking value.

Embeddings take a mask_zero boolean parameter which can generate that mask automatically anywhere there's a 0 on the input.

keras-team / keras

Masks for RNNs #176