Closed elanmart closed 9 years ago
How would you suggest masks are implemented?
For now, you could simply group your samples into batches where all samples have the sample lengths, or even simpler (but slower): use a batch size of 1 (and no zero-padding).
Well, I've tried both these methods before. Batch of size 1 is indeed too slow, and grouping samples by length is something I don't find too elegant. I'm also not sure if it does not hurt performance, since one is not able to sample the data fully randomly.
I'm not really an expert when it comes to implementing stuff in Theano, but I think people from Lasagne have something like this:
https://github.com/craffel/nntools/blob/master/lasagne/layers/recurrent.py
I've worked a bit with masks for RNN, it can be implemented in many different ways. I think it can be quite useful.
If you're interested in the last output only, one easy way is to pass the mask to the step function so that is doesn't compute anything when the mask is 0 (state and output stays the same): step ([...], h_tm1, mask): [...] tmp_h_t = //computation here h_t = (1 - mask) * t_tm1 + mask * tmp_h_t
The input to the whole model is [sequences, masks]. Could also be computed in theano
If interested in the whole output sequence, you also need to compute a masked loss which can be tricky
If interested in the whole output sequence, you also need to compute a masked loss which can be tricky
The current layers can output both the last output or the entire sequence. We need to have to be compatible with that.
I wonder how much of a bad practice it would be not to keep a separate mask variable, and instead just stop the iteration when an all-0 input is found at a certain timestep. It would make things much easier. What do you guys think?
I thought about it but it'll work only in case of one example per batch or all examples of the same size, right?
If we're sure to always pad with zeros and that no input are all-0 before the end of the sequence that would be OK. Still, you need to carry on the computation for the inputs in the batch that have a larger size, while keeping the result for the 'stopped' ones. Can be done in the step function.
if output sequence is on, you have a batch of sequences of outputs, but some of the output sequences are padded, which is not easy to deal with
I took a stab at this in #239. I'm still massaging it a bit, and it's just in the SimpleRNN for the moment, but I'd be interested to get your feedback.
My issue now is how best to get the mask input to pass it into the SimpleRNN (mine come after an Embedding layer so I need to use Merge to merge them back in. I'm working on that now). @fchollet this would be an issue with what you describe of not keeping a separate mask variable, since after an embedding there would be no all-0 input.
I suppose another option would be to put a constraint on the Embedding forcing it to not be allowed to learn a representation for the "pad" value.
this would be an issue with what you describe of not keeping a separate mask variable, since after an embedding there would be no all-0 input.
Correct, but how would masking work with Embedding layers in the case of a separate mask parameter?
It would be very easy to rectify the Embedding layer to output all-0 feature vectors for 0-inputs. After the embedding stage, just go over the input indices and when a zero is encountered, set the corresponding feature vector to 0.
This would be compatible with our text preprocessing utils, which assume that 0 is a non-character.
how would masking work with Embedding layers in the case of a separate mask parameter?
I was thinking of either modifying Embedding to optionally pass-through a mask (following the convention that masks are always concatenated along the time dimension), or else using a Merge to concatenate the embedding with the mask.
It would be very easy to rectify the Embedding layer to output all-0 feature vectors for 0-inputs
Hmm, doesn't this introduce a small probability that, for instance, an all 0 vector is learned by the embedding layer which then gets "stuck"? I suppose for high-dimensional vectors that's pretty unlikely. But this could happen at any stage of the network. If a value ever "happens" to hit 0 suddenly its properties change.
Perhaps safer to use, e.g. NaN or -Inf but I don't know how those interact with the GPU.
Also: for a large feature vector isn't it quite inefficient to iterate over the entire vector just to check if it's masked?
Also: for a large feature vector isn't it quite inefficient to iterate over the entire vector just to check if it's masked?
Yes, but that should still be negligible compared to the matrix multiplications for non-zero vectors.
Regarding the Embedding layer, the fix could be done by adding one line:
self.W = self.init((self.input_dim, self.output_dim))
self.W[0] *= 0. # the embedding of index 0 (non-character) will be an all-zero vector
Then every recurrent layer would check if the current input is all 0, and return a 0 output if that's the case (in the case of LSTM it would also return the previous memory unchanged). Thoughts?
Hmm, doesn't this introduce a small probability that, for instance, an all 0 vector is learned by the embedding layer which then gets "stuck"? I suppose for high-dimensional vectors that's pretty unlikely. But this could happen at any stage of the network. If a value ever "happens" to hit 0 suddenly its properties change.
I think that's statistically impossible because every value in the feature vector would need to reach exactly zero, starting from a random initialization. Even if all-zero happened to be an optimum in the context of some task, the learned value could end up epsilon-close to all-zero but likely never all-zero.
Then every recurrent layer would check if the current input is all 0, and return a 0 output if that's the case
I think you'd prefer to return h_tm1
here since in your examples and utilities you post-pad with 0 for shorter sequences (or change the examples to pre-pad, I suppose. I think pre-padding makes a bit more sense anyway).
I guess this is a bit easier to understand than concatenating the mask to the input. But potentially more prone to "accidental" bugs, where the user passes in some zero-data without understanding this effect, and gets strange behaviour. Like if this becomes the standard behaviour in all layer types, what if I'm doing a CNN on an image and I happen have a patch of black pixels?
what if I'm doing a CNN on an image and I happen have a patch of black pixels?
Typically you'll first run your input through a conv2D layer, then run a sequence of output vectors through a recurrent layer. Again, it will be statistically impossible for the processed vectors to be all zero.
I agreed that the behavior seems "dirty", however as long as the behavior is clearly documented we should be fine. And accidental bugs will be so improbable as to be impossible.
The main argument for this setup is that it introduces no architecture issues (the nature and shape of the data being passed around is unchanged) and it is very easy to implement / simple to understand.
I think pre-padding makes a bit more sense anyway
Agreed on that.
If you pre-pad, you could even mask the 0s from the beginning only: when a non zero entry appears in the vector, every following entries are considered and computed, even all-0s. I think then it's the cleanest!
You can compute this mask in the layer computation and pass it to the step; when I get a bit of time I'll try and write that.
I think that's statistically impossible because every value in the feature vector would need to reach exactly zero
Ah wait: what about after ReLU activation? Suddenly getting all 0 becomes significantly more likely (i.e. 1/2^n where n is the feature vector dimension)
We'll definitely switch to pre-padding (it's a trivial change).
Ah wait: what about after ReLU activation? Suddenly getting all 0 becomes significantly more likely (i.e. 1/2^n where n is the feature vector dimension)
That's right. I think a good solution would be to make the mask value configurable in the Embedding layer and the recurrent layers, much like what XGBoost does. The default could be, for instance, -999.0.
model.add(Embedding(indim, outdim, mask_value=-1.)) # replaces index 0 with all-(-999.) vectors
model.add(SimpleRNN(outdim, outdim, mask_value=-1.)) # skips all-(-999.) vectors
OK @fchollet sounds like you're pretty set on the mask_value
approach, which seems fine, you're right that it will be simpler to implement everywhere. Feels slightly "wrong" to me but that's just aesthetic.
I'm happy to implement this, but let me know if you're doing it so we don't dupe work.
Is it confusing that the Embedding input expects 0 as a pad, if everywhere else expects -999 (or whatever) as a pad? Seems a bit inconsistent for the API that on SimpleRNN mask_value
would represent what inputs are masked, but on Embedding they would represent how to represent a pad on the output.
Is it confusing that the Embedding input expects 0 as a pad, if everywhere else expects -999 (or whatever) as a pad? Seems a bit inconsistent for the API that on SimpleRNN mask_value would represent what inputs are masked, but on Embedding they would represent how to represent a pad on the output.
The reason for the discrepancy is that the input of an Embedding is a tensor of indices, which are positive integers. The default convention for the non-character index is 0.
The rest of the network uses an arbitrary mask value (float).
OK I've put up a preliminary implementation at #244, would love some review before I dive in to getting more of the recurrent types supported.
Btw, looks like Bricks takes the approach I did initially, of having a separate channel over which the mask is sent:
The PR implementing masks has now been merged, for those of you watching this issue.
Yes, but that should still be negligible compared to the matrix multiplications for non-zero vectors. Regarding the Embedding layer, the fix could be done by adding one line: self.W = self.init((self.input_dim, self.output_dim)) self.W[0] *= 0. # the embedding of index 0 (non-character) will be an all-zero vector Then every recurrent layer would check if the current input is all 0, and return a 0 output if that's the case (in the case of LSTM it would also return the previous memory unchanged). Thoughts?
Has this been implemented? I looked in the source code I couldn't find this.
are you referring to the recurrent pass through, @mbchang? if so, check the backend code. e. g. in the theano back end, there is a switch over 0 for the next hidden state.
@mbchang in general after this discussion Keras ended up moving to a separate explicitly sent mask after all, rather than a special masking value.
Embeddings take a mask_zero
boolean parameter which can generate that mask automatically anywhere there's a 0 on the input.
Hey,
I think it would be cool if we could specify when the recurrent network should stop updating its hidden state. For example if my sequences have max length 100, and a particular example has length of only 10, the network will update its hidden state 90 times before returning the final vector, which is not neccesarly desireable.