keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.98k stars 19.48k forks source link

Masking zeros not supported in some layers #2728

Closed kavehtp closed 8 years ago

kavehtp commented 8 years ago

Hi,

I am trying to implement a model over zero-padded sequences. The problem is when I use mask_zero=True some layers do not support it. For example, in the following code, the Dense layer throws an error that says it does not support masking:

# Mean over time implementation
def MeanOverTime():
    layer = Lambda(lambda x: K.mean(x, axis=1), output_shape=lambda s: (s[0], s[2]))
    return layer

model.add(Embedding(vocab_size, emb_dim, mask_zero=True))
model.add(LSTM(lstm_dim, return_sequences=True))
model.add(MeanOverTime())
model.add(Dense(10))
model.add(Activation('softmax'))

Is there an easy way to fix this? Thanks.

Kaveh

braingineer commented 8 years ago

Do you even need a mask past the LSTM? You are squashing over time as it is.

Which, by the way, you aren't actually using your mask to do the mean, so the zero-padded items aren't being ignored.

My suggestion is to use a custom layer to implement MeanOverTime. I have this one which may work: https://github.com/braingineer/ikelos/blob/master/ikelos/layers/utility.py#L29

Basically, it lets you pass in a function to modify the mask rather than the layer. It's quick and dirty and a bit hacky, but it works in a pinch.

However, in thinking about your problem, you can't just use this. I would take the LambdaMask layer, modify it so that it in call it outputs

if mask is not None:
    if K.ndim(mask) == K.ndim(x) - 1:
        mask = K.expand_dims(mask)
    x *= mask
return x

this way, your mask is being given a broadcastable dimension so it will be able to multiply across your feature dimension. it is also correctly 0ing out the time values you don't care about.

in the compute mask portion, you will want to just return None. this gets rid of the mask in the pipeline. Now, your Dense can properly do its job without having to worry.

kavehtp commented 8 years ago

Yeah, I don't need the mask after MeanOverTime layer. I just did not know how to remove mask after MeanOverTime. I implemented this really really ugly layer instead (it works though):

def MeanOverTime():
    mean_func = lambda x: K.cast((x.sum(axis=1) / (x.shape[1] - K.equal(x, 0).all(axis=2).sum(axis=1, keepdims=True))), K.floatx())
    layer = Lambda(mean_func, output_shape=lambda s: (s[0], s[2]))
    layer.supports_masking = True
    def compute_mask(input, mask):
        return None
    layer.compute_mask = compute_mask
    return layer

I also did not know how to access the mask inside this function. That's why I am using K.equal(...).all(...) hack. I am gonna fix it now. Thanks!

mupavan commented 8 years ago

1579 issue seems to have already tackled your problem in an elegant way.

You could use the following function for mean over time.

def lambda_mask_average(x,mask=None):
    return K.batch_dot(x,mask,axes=1) / K.sum(mask, axis=-1, keepdims=True)
main_input = Input(shape=(input_length,),dtype='int32')
m = Embedding(vocab_size+1, emb_size, input_length=input_length, mask_zero=True)(main_input)
m = LSTM(lstm_dim, return_sequences=True)(m)
m = MaskEatingLambda(lambda_mask_average, output_shape=(lstm_dim,))(m)
# no more mask layer. 
# insert whatever other layers you want here
model = Model(input=main_input, output=m)
kavehtp commented 8 years ago

Thanks @mpavankumarreddy. The problem is solved. I am closing the issue.

cdicle commented 8 years ago

Hey @kavehtp,

I am using the same architecture from @sergeyf for a toy problem. I added a dense layer after the averaging layer as such

def lambda_mask_average(x,mask=None):
    return K.batch_dot(x,mask,axes=1) / K.sum(mask, axis=-1, keepdims=True)

def lambda_mask_sum(x,mask=None):
    return K.batch_dot(x,mask,axes=1)

main_input = Input(shape=(maxlen,), dtype='int32')
x = Embedding(max_features+1, embed_dim, input_length=maxlen, dropout=0.2, mask_zero=True)(main_input)
x = LSTM(lstm_dim, dropout_W=0.2, dropout_U=0.2, return_sequences=True)(x)
x = MaskEatingLambda(lambda_mask_average,output_shape=(lstm_dim,))(x)
pred = Dense(1,activation='sigmoid')(x)

However, Dense layer gives an error because the input type is upcasted by MaskEatingLambda layer to float64 and it expects float32. On the other hand, that problem does not occur if I use lambda_mask_sum function.

Have you come across with a similar problem in your implementation? Can you suggest a fix for it?

Thanks

sergeyf commented 8 years ago

You'll probably have to use K.cast on the sum over mask.

On Fri, Jun 17, 2016 at 9:05 AM, Caglayan Dicle notifications@github.com wrote:

Hey @kavehtp https://github.com/kavehtp,

I am using the same architecture from @sergeyf https://github.com/sergeyf for a toy problem. I added a dense layer after the averaging layer as such

def lambda_mask_average(x,mask=None): return K.batch_dot(x,mask,axes=1) / K.sum(mask, axis=-1, keepdims=True)

def lambda_mask_sum(x,mask=None): return K.batch_dot(x,mask,axes=1)

main_input = Input(shape=(maxlen,), dtype='int32') x = Embedding(max_features+1, embed_dim, input_length=maxlen, dropout=0.2, mask_zero=True)(main_input) x = LSTM(lstm_dim, dropout_W=0.2, dropout_U=0.2, return_sequences=True)(x) x = MaskEatingLambda(lambda_mask_average,output_shape=(lstm_dim,))(x) pred = Dense(1,activation='sigmoid')(x)

However, Dense layer gives an error because the input type is upcasted by MaskEatingLambda layer to float64 and it expects float32. On the other hand, that problem does not occur if I use lambda_mask_sum function.

Have you come across with a similar problem in your implementation? Can you suggest a fix for it?

Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/2728#issuecomment-226810579, or mute the thread https://github.com/notifications/unsubscribe/ABya7HXMtWMXFcKBjuhzpICz9D6UqU3gks5qMsW6gaJpZM4IfBxy .

cdicle commented 8 years ago

Hey @sergeyf. Thanks for quick reply.

I tried

return K.batch_dot(x,mask,axes=1) / K.cast_to_floatx(K.sum(mask, axis=-1, keepdims=True))

and got the error

ValueError: setting an array element with a sequence.

Now, I could not fix that one, mainly, due to the lack of my python knowledge, but I believe you guys can help me.

sergeyf commented 8 years ago

I use it like this: K.cast(x,'float32').

On Fri, Jun 17, 2016 at 10:48 AM, Caglayan Dicle notifications@github.com wrote:

Hey @sergeyf https://github.com/sergeyf. Thanks for quick reply.

I tried

return K.batch_dot(x,mask,axes=1) / K.cast_to_floatx(K.sum(mask, axis=-1, keepdims=True))

and got the error

ValueError: setting an array element with a sequence.

Now, I could not fix that one, mainly, due to the lack of my python knowledge, but I believe you guys can help me.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/2728#issuecomment-226835691, or mute the thread https://github.com/notifications/unsubscribe/ABya7HDTBlZP9BeJxW9iKNrYKsGby7Vjks5qMt38gaJpZM4IfBxy .

cdicle commented 8 years ago

That solved the problem. Thanks man!

RaffEdwardBAH commented 8 years ago

Has anyone checked if these work with the tensorflow backend? Trying the code by @mpavankumarreddy

I get this error:

/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.pyc in batch_dot(x, y, axes)
    247         adj_x = None
    248         adj_y = None
--> 249     out = tf.batch_matmul(x, y, adj_x=adj_x, adj_y=adj_y)
    250     if ndim(out) == 1:
    251         out = expand_dims(out, 1)

/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.pyc in _batch_mat_mul(x, y, adj_x, adj_y, name)
    387   """
    388   result = _op_def_lib.apply_op("BatchMatMul", x=x, y=y, adj_x=adj_x,
--> 389                                 adj_y=adj_y, name=name)
    390   return result
    391 

/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.pyc in apply_op(self, op_type_name, name, **keywords)
    702           op = g.create_op(op_type_name, inputs, output_types, name=scope,
    703                            input_types=input_types, attrs=attr_protos,
--> 704                            op_def=op_def)
    705           outputs = op.outputs
    706           return _Restructure(ops.convert_n_to_tensor(outputs),

/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.pyc in create_op(self, op_type, inputs, dtypes, input_types, name, attrs, op_def, compute_shapes, compute_device)
   2260                     original_op=self._default_original_op, op_def=op_def)
   2261     if compute_shapes:
-> 2262       set_shapes_for_outputs(ret)
   2263     self._add_op(ret)
   2264     self._record_op_seen_by_control_dependencies(ret)

/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.pyc in set_shapes_for_outputs(op)
   1700       raise RuntimeError("No shape function registered for standard op: %s"
   1701                          % op.type)
-> 1702   shapes = shape_func(op)
   1703   if shapes is None:
   1704     raise RuntimeError(

/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.pyc in _BatchMatMulShape(op)
   1383   if a_shape.dims is None and b_shape.dims is None:
   1384     return [tensor_shape.unknown_shape()]
-> 1385   batch_dims = a_shape[:-2].merge_with(b_shape[:-2])
   1386   output_rows = a_shape[-1] if adj_a else a_shape[-2]
   1387   output_cols = b_shape[-2] if adj_b else b_shape[-1]

/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_shape.pyc in merge_with(self, other)
    568       except ValueError:
    569         raise ValueError("Shapes %s and %s are not compatible" %
--> 570                          (self, other))
    571 
    572   def concatenate(self, other):

ValueError: Shapes (?,) and () are not compatible
McVilla commented 8 years ago

Hi @braingineer , I'm new to keras and I want to process sentences with different number of words in CNN. I used zero-padding, but layers after Embedding layer doesn't support masking. Is there any way to solve it?

inputs = Input(shape=(1,max_len),dtype='int32') x= Embedding(vocab_size, dim,weights = GloVe, input_length=max_len)(inputs) x = Reshape((1,max_len,50))(x) x = Convolution2D(nb_filter, n_gram, dim,init='glorot_uniform', activation='linear',border_mode='valid', subsample=(1,1))(x) x = MaxPooling2D(pool_size=(2,1))(x1) x = Flatten()(x) out = Dense(10)(x)

Conv_sen= Model(inputs,out)

By the way, does keras support global pooling? Thanks.

sergeyf commented 8 years ago

For anyone who stumbles onto this post looking to deal with Embeddings, zeros, and masks, the following works in both Theano and TF.

My solution to this problem is as follows:

(1) Make a custom ZeroMaskedEntries layer that (a) zeros out all of the masked-out embedding rows and (b) swallows the mask so it doesn't pass on.

(2) Use a lambda function called mask_aware_mean that knows to ignore all-zero rows when taking the mean.

This is a little bit silly (inefficient) because first I get rid of the mask, and then I reconstruct, but it gets rid of the whole MaskEatingLambda business. You can also use ZeroMaskedEntries in other places, and easily modify it to pass on the mask if need be.

Here is ZeroMaskedEntries:

import keras.backend as K
from keras.engine.topology import Layer

class ZeroMaskedEntries(Layer):
    """
    This layer is called after an Embedding layer.
    It zeros out all of the masked-out embeddings.
    It also swallows the mask without passing it on.
    You can change this to default pass-on behavior as follows:

    def compute_mask(self, x, mask=None):
        if not self.mask_zero:
            return None
        else:
            return K.not_equal(x, 0)
    """

    def __init__(self, **kwargs):
        self.support_mask = True
        super(ZeroMaskedEntries, self).__init__(**kwargs)

    def build(self, input_shape):
        self.output_dim = input_shape[1]
        self.repeat_dim = input_shape[2]

    def call(self, x, mask=None):
        mask = K.cast(mask, 'float32')
        mask = K.repeat(mask, self.repeat_dim)
        mask = K.permute_dimensions(mask, (0, 2, 1))
        return x * mask

    def compute_mask(self, input_shape, input_mask=None):
        return None

Below is a way to take the mean of what comes out of ZeroMaskedEntries. It does the silly business mentioned above of reconstructing the mask, but the computational hit is minor in my experience.

def mask_aware_mean(x):
    # recreate the masks - all zero rows have been masked
    mask = K.not_equal(K.sum(K.abs(x), axis=2, keepdims=True), 0)

    # number of that rows are not all zeros
    n = K.sum(K.cast(mask, 'float32'), axis=1, keepdims=False)

    # compute mask-aware mean of x
    x_mean = K.sum(x, axis=1, keepdims=False) / n

    return x_mean

def mask_aware_mean_output_shape(input_shape):
    shape = list(input_shape)
    assert len(shape) == 3 
    return (shape[0], shape[2])

And here is a test to make sure it all works:

import numpy as np
from keras.layers import Input, Embedding, Lambda
from keras.models import Model

output_dim = 2
input_dim = 25
input_length = 4
main_input = Input(shape=(input_length,), dtype='int32')
embed = Embedding(output_dim=output_dim, input_dim=input_dim, input_length=input_length, mask_zero=True)(main_input)
embed_zeroed = ZeroMaskedEntries()(embed)
lambda_mean = Lambda(mask_aware_mean, mask_aware_mean_output_shape)(embed_zeroed)

model = Model(input=main_input,output=lambda_mean)
model.compile(optimizer='rmsprop',loss='mse')

# test
test_input = [[0,0,2,0],[0,0,0,1],[0,0,2,1]]
test_output =  model.predict(test_input)
print('Mean is working?', np.all(np.isclose(test_output[0:2,:].mean(0),test_output[2,:])))
Qululu commented 8 years ago

@sergeyf Thank you very much for your solution above!

I know my solution below is wrong, but can you explain to me why it is incorrect in achieving the average embedding vector?

def means(x):
    return K.mean(x, axis=1)

model = Sequential()
model.add(Embedding(num_features+2, 128))
model.add(Lambda(means, output_shape=(128,)))
model.add(Masking(mask_value=0))
model.add(Dense(64, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))

I'm not getting any errors like the OP but I assume it's still computing the average with the zeros, correct? Thanks!

sergeyf commented 8 years ago

I think because Dense doesn't support masking.

iskandr commented 8 years ago

By chance I currently have the same problem: I want to average the outputs of a TimeDistributed Dense layer across all non-masked timesteps. Is @kavehtp's MeanOverTime the best way to do it?

edit: Just saw @sergeyf's mask_aware_mean.

sergeyf commented 8 years ago

@iskandr I haven't tried my approach after a TimeDistributed anything, so not sure how it would work. If it does, please put up an example here for posterity!

kavehtp commented 8 years ago

Here is another (slightly cleaner?) alternative implementation:

class MeanOverTime(Layer):
    def __init__(self, **kwargs):
        self.supports_masking = True
        super(MeanOverTime, self).__init__(**kwargs)

    def call(self, x, mask=None):
        return K.cast(x.sum(axis=1) / mask.sum(axis=1, keepdims=True), K.floatx())

    def get_output_shape_for(self, input_shape):
        return (input_shape[0], input_shape[2])

    def compute_mask(self, x, mask):
        return None

    def get_config(self):
        config = {}
        base_config = super(MeanOverTime, self).get_config()
        return dict(list(base_config.items()))
Qululu commented 8 years ago

@sergeyf Just a follow-up question... I am observing that using your solution, my model always predicts the most frequent class for each and every test example. My goal is to average the embeddings of variable-length vectors (which I padded with zeros) and to predict one of n_classes classes.

Here is my implementation using your solution:

model = Sequential()
model.add(Embedding(num_features+2, 128, mask_zero=True))
model.add(ZeroMaskedEntries())
model.add(Lambda(mask_aware_mean))
model.add(Dense(n_classes, activation='softmax'))

When I disable the mask_zero flag and remove the ZeroMaskedEntries layer, it seems to suddenly work (i.e., doesn't always predict the same class for every example) as follows:

model = Sequential()
model.add(Embedding(num_features+2, 128))
model.add(Lambda(mask_aware_mean))
model.add(Dense(n_classes, activation='softmax'))

Why could this phenomenon be happening and what may I be doing wrong? Thanks!

sergeyf commented 8 years ago

@Qululu I am not sure. The use case I had was the same as yours. I have a bunch of variable-length text, and I wanted to train a smart average of it, without getting messed up by the zero-index vector that comes out of the Embedding layer. I found a model that works better for my use-case (sorry for the extra classes etc):

from keras.engine import Layer

class NamedLambda(Lambda):
    def __init__(self, name=None):
        Lambda.__init__(self, self.fn, name=name)

    @classmethod
    def invoke(cls, args, **kw):
        return cls(**kw)(args)

    def __repr__(self):
        return '%s(%s)' % (self.__class__.__name__, self.name)

class L2Normalize(NamedLambda):
    def fn(self, x):
        return K.l2_normalize(x, axis=-1)

class Sum(NamedLambda):
    def fn(self, x):
        return K.sum(x, axis=1)

embed_direction = Embedding(output_dim=output_dim,
                                input_dim=input_dim,  mask_zero=True)
mask_direction = ZeroMaskedEntries()
embedding = mask_direction(embed_direction(main_input)
sum = Sum.invoke(embedding, name='the_sum')
l2_normed_sum = L2Normalize.invoke(sum, name='l2_sum')

Try that. Or some other ideas for how to debug your original code:

(1) Alter mask_aware_mean to just take a dumb average ignoring the mask. This will confirm that it's not ZeroMaskedEntries that's causing the problem. (2) Alter ZeroMaskedEntries to just return x. This will confirm that it's not mask_aware_mean that is causing the problem.

Qululu commented 8 years ago

@sergeyf Ok, I tried both debug suggestions. While (2) confirmed mask_aware_mean was indeed working, (1) led me to stumble upon an interesting phenomenon...

When I changed the mask_aware_mean implementation to ignore the mask as such:

def mask_aware_mean(x):
    n = K.sum(K.cast(x, 'float32'), axis=1, keepdims=False)
    x_mean = K.sum(x, axis=1, keepdims=False) / n
    return x_mean

then I observed the problem still occurring. However, if I used the following dumb averaging implementation:

def means(x)
    return K.mean(x, axis=1)

then it worked.

So what is the difference between the two implementations? Aren't they both equivalent?

Also, if this implies that ZeroMaskedEntries is causing the problem, then how can I debug it too? Thanks!

sergeyf commented 8 years ago

@Qululu This looks wrong: n = K.sum(K.cast(x, 'float32'), axis=1, keepdims=False)

n should be the number of rows. Originally it was n = K.sum(K.cast(mask, 'float32'), axis=1, keepdims=False), which makes sense since mask is binary. x is not binary, so you're not going to get the number of rows from your operation.

Qululu commented 8 years ago

@sergeyf Hi Sergey, oops you're right. Here, I corrected it to the best of my ability:

def mask_aware_mean(x):
    # All values will meet the criterion >= 0
    mask = K.greater_equal(K.sum(K.abs(x), axis=2, keepdims=True), 0)
    n = K.sum(K.cast(mask, 'float32'), axis=1, keepdims=False)
    x_mean = K.sum(x, axis=1, keepdims=False) / n
    return x_mean

The above implementation computes the dumb average and indeed the phenomenon (model predicting most dominant class for every example) goes away. You suggested in (1) that this could mean ZeroMaskedEntries is causing the problem. Now I'm stuck, any help would be appreciated.

Not sure this is relevant, but I'm trying to compute the average embeddings for variable-length and non-sequential sets of one-hot encoded words. However, I still need to treat these sets as lists and pad them with zeros to be of uniform size for feeding into the Embedding layer, right?

from keras.preprocessing import sequence
embedding_layer_input = sequence.pad_sequences(np.array(word_idxs), maxlen=MAX_WORD_IDX_LEN)

Is this the correct way to handle embeddings of variable-length non-sequential tokens?

sergeyf commented 8 years ago

@Qululu Yes, your padding looks right assuming that word_idxs is something like [[1,2,3],[4,2,1],...]

I'm not sure what the problem is. I think you need to set up a test bed to carefully go through and figure out where things stop behaving the way you want with your particular dataset. It sounds like you haven't found any particular part of my code to be broken in any obvious way, so I don't know how to help you debug further.

Also, try my K.l2_normalize solution.

Sorry I can't be of more help. Please report anything you find though - I'm sure others will find it useful.

cbaziotis commented 7 years ago

Hi, i have i little problem with masking. There are training examples in my datasets for which no masking is applied because their length is equal (or slightly less) than the input_length in the Embedding Layer. My problem is that using the following Layer:

class MeanOverTime(Layer):
    def __init__(self, **kwargs):
        self.supports_masking = True
        super(MeanOverTime, self).__init__(**kwargs)

    def call(self, x, mask=None):
        if mask is not None:
            return K.cast(x.sum(axis=1) / mask.sum(axis=1, keepdims=True), K.floatx())
        else:
            return K.mean(x, axis=1)

    def get_output_shape_for(self, input_shape):
        return input_shape[0], input_shape[-1]

    def compute_mask(self, input, input_mask=None):
        return None

there are case that this mask.sum(axis=1, keepdims=True) leads to division by zero. When this happens loss becomes nan.

I don't know how to work with tensors and i need just to add this little check. In order to bypass this i have increased the input_length so it covers all my training examples. Also i tried adding a try/except but this also didn't work.

cbaziotis commented 7 years ago

This is what i did. Hope it helps someone...

class MeanOverTime(Layer):
    def __init__(self, **kwargs):
        self.supports_masking = True
        super(MeanOverTime, self).__init__(**kwargs)

    def call(self, x, mask=None):
        if mask is not None:
            mask = K.cast(mask, 'float32')
            s = mask.sum(axis=1, keepdims=True)
            if K.equal(s, K.zeros_like(s)):
                return K.mean(x, axis=1)
            else:
                return K.cast(x.sum(axis=1) / mask.sum(axis=1, keepdims=True), K.floatx())
        else:
            return K.mean(x, axis=1)

    def get_output_shape_for(self, input_shape):
        return input_shape[0], input_shape[-1]

    def compute_mask(self, input, input_mask=None):
        return None
JakubKolodziej commented 6 years ago

@cbaziotis thanks for your snippet. It seems to throw errors with my version of keras. Here is my updated version (I flatten the output vector):

class MeanOverTime(Layer):
  def __init__(self, **kwargs):
    self.supports_masking = True
    super(MeanOverTime, self).__init__(**kwargs)

  def call(self, x, mask=None):
    if mask is not None:
      mask = K.cast(mask, K.floatx())
      mask_sum = K.sum(mask, axis=1, keepdims=True)
      mask_sum = K.maximum(1.0, mask_sum)
      return K.sum(x, axis=1, keepdims=False) / mask_sum
    else:
      return K.mean(x, axis=1, keepdims=False)

  def compute_output_shape(self, input_shape):
    return (input_shape[0], input_shape[-1])

  def compute_mask(self, input, input_mask=None):
    return None