keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.63k stars 19.42k forks source link

Why is the first padding vector in embedding layer updated during training #5392

Closed lxafly closed 7 years ago

lxafly commented 7 years ago

Hello!

I have some confusion about the weight updates for embedding layer, and wondering if someone could shed some light. (It could be a mechanism that I don't understand or could be something missing in my setup)

My original code is lengthy, but here is a summary of the major setup.

And say after the training is done, and loaded the saved model. I got a embedding matrix called my_adapted_embed which should be different from my_embed, due to back propagation. What I expected is that my_adapted_embed[0] should stay intact to be all zeros. Because they corresponding to the padding 0-th index, where masking should have kept it from updating. But what I found is my_adapted_embed[0] actually changed. Could someone shed light here, what is it I don't understand about masking?

Thanks a lot for your help!

unrealwill commented 7 years ago

Hello,

Nice catch.

I'm not an expert on masking at all. But here is what probably going on with masks. In most layer implementations we are not using the mask variable in the function call of the layer. Probably because it would be quite expensive to do so (i.e. doing a x = x*mask as the first operation of the layer would work but is expensive), for very marginal gain. So we end-up using masked values when we do some things like temporal pooling, so the gradients will end-up modifying masked inputs.

In addition, most layers don't recompute the mask (by overloading compute_mask) using the appropriate border_mode (to keep track of mask-tainted values) and just transfer from the previous mask to the next layer. It's not quite clear that using the same border_mode would be the required behaviour (take for example the case of one very short sequence which if we compute mask using a border_mode will probably get completely masked, and therefore it will be impossible to learn any short sequence).

    #in _Pooling1D(Layer): we don't use the mask variable inside call

    def call(self, x, mask=None):
        x = K.expand_dims(x, 2)   # add dummy last dimension
        output = self._pooling_function(inputs=x, pool_size=self.pool_size,
                                        strides=self.st,
                                        border_mode=self.border_mode,
                                        dim_ordering='tf')
      return K.squeeze(output, 2) # remove dummy last dimension`

And this is why your value get modified, and you probably shouldn't worry :)

lxafly commented 7 years ago

Okay I see. Thanks a lot for your quick response!