Why is the first padding vector in embedding layer updated during training

lxafly commented 7 years ago

Hello!

I have some confusion about the weight updates for embedding layer, and wondering if someone could shed some light. (It could be a mechanism that I don't understand or could be something missing in my setup)

My original code is lengthy, but here is a summary of the major setup.

I have a vocabulary of V, where the first token gets mapped to index 1.
Therefore the input_dim to the Embedding layer is (V + 1). Because the 0-th index is for sequence padding. For example three sentences of various length, and say max length is 5. [40,20] [19,17,15] [5] Would be padded as [40,20,0,0,0] [19,17,15,0,0] [5,0,0,0,0]
- Let's say with the 0-th index the embedding matrix is called my_embed. and has (V+1) rows. With the 0-th row being all zeros, the rest initialized from glove word vectors. The Embedding layer is defined as below. embed_layer = Embedding(input_dim=V+1, output_dim=200, weights=[my_embed],mask_zero=True)
In brief, an input sequence goes through the Embedding layer, and then a MeanPooling(or MaxPooling) layer, and then normal dense layer etc. (Parts after that are not important for the discussion here)

And say after the training is done, and loaded the saved model. I got a embedding matrix called my_adapted_embed which should be different from my_embed, due to back propagation. What I expected is that my_adapted_embed[0] should stay intact to be all zeros. Because they corresponding to the padding 0-th index, where masking should have kept it from updating. But what I found is my_adapted_embed[0] actually changed. Could someone shed light here, what is it I don't understand about masking?

Thanks a lot for your help!

unrealwill commented 7 years ago

Hello,

Nice catch.

I'm not an expert on masking at all. But here is what probably going on with masks. In most layer implementations we are not using the mask variable in the function call of the layer. Probably because it would be quite expensive to do so (i.e. doing a x = x*mask as the first operation of the layer would work but is expensive), for very marginal gain. So we end-up using masked values when we do some things like temporal pooling, so the gradients will end-up modifying masked inputs.

In addition, most layers don't recompute the mask (by overloading compute_mask) using the appropriate border_mode (to keep track of mask-tainted values) and just transfer from the previous mask to the next layer. It's not quite clear that using the same border_mode would be the required behaviour (take for example the case of one very short sequence which if we compute mask using a border_mode will probably get completely masked, and therefore it will be impossible to learn any short sequence).

    #in _Pooling1D(Layer): we don't use the mask variable inside call

    def call(self, x, mask=None):
        x = K.expand_dims(x, 2)   # add dummy last dimension
        output = self._pooling_function(inputs=x, pool_size=self.pool_size,
                                        strides=self.st,
                                        border_mode=self.border_mode,
                                        dim_ordering='tf')
      return K.squeeze(output, 2) # remove dummy last dimension`

And this is why your value get modified, and you probably shouldn't worry :)

lxafly commented 7 years ago

Okay I see. Thanks a lot for your quick response!

keras-team / keras

Why is the first padding vector in embedding layer updated during training #5392