A question about MaskNet

Hi, it is really a nice work, and the task proposed in your paper is a valuable research issue. But I have a question about MaskNet: according to chapter 4.1 and fig3.B of your paper, the proposals are generated from one-hot vectors and target encoder outputs, and scene encoder outputs are not used to generate proposals. However, in you codes, the function mask_net(codes are showed below) generates proposals from one-hot vectors and the output of function matching_filter which takes both targets_encoded and images_encoded as inputs. Is there any thing wrong? ``

def mask_net(targets, images, labels=None, feature_maps=24, training=False, threshold=0.3, rgb_mean=127, rgb_std=127):
    # encode target
    targets_encoded, _ = encoder(targets,
                                 feature_maps=feature_maps,
                                 dilated=False, reuse=False,
                                 scope='clean_encoder')

    images_encoded, images_encoded_end_points = encoder(images,
                                                        feature_maps=feature_maps,
                                                        dilated=True, reuse=False,
                                                        scope='clutter_encoder')

    # calculate crosscorrelation
    # target_encoded has to be [batch, 1, 1, fmaps] for this to work
    matched = matching_filter(images_encoded, targets_encoded, mode='standard')
    matched = matched * targets_encoded

    # Get size of matching
    isz = images.get_shape().as_list()
    ix = isz[1]
    iy = isz[2]
    tsz = targets.get_shape().as_list()
    tx = tsz[1]
    ty = tsz[2]
    msz = matched.get_shape().as_list()
    batch_size = msz[0]
    mx = msz[1]
    my = msz[2]

    # Generate proposals
    # There are 3 modes:
    # Training the encoder and decoder
    # Training the discriminator
    # Evaluation
    if training != False:

        # Get center of mass of labels to determine fg proposals
        comx, comy = center_of_mass(labels)
        comx = comx / (ix / mx)
        comy = comy / (iy / my)

        # Initialize fg & bgindices
        fg_index = [0 for x in range(4)]
        # select the 4 locations around the label com as fg proposals
        fg_index[0] = tf.cast(tf.floor(comx) + my * tf.floor(comy), tf.int32)
        fg_index[1] = tf.cast(tf.ceil(comx) + my * tf.floor(comy), tf.int32)
        fg_index[2] = tf.cast(tf.floor(comx) + my * tf.ceil(comy), tf.int32)
        fg_index[3] = tf.cast(tf.ceil(comx) + my * tf.ceil(comy), tf.int32)
        # Draw random indices for bg proposals
        bg_index = [tf.expand_dims(tf.random_shuffle(tf.range(0, mx * my)), axis=1) for x in range(batch_size)]
        bg_index = tf.concat(bg_index, axis=1)

        if training == 'encoder_decoder':
            # Generate 4 foreground and 4 background proposals
            num_proposals = 8
            proposal_range = range(num_proposals)

            # create index
            index = [0 for x in proposal_range]
            for i in range(4):
                index[i] = fg_index[i]
            # Select 4 random locations for bg proposals
            for i in range(4, num_proposals):
                index[i] = bg_index[i, :]
                if any([index[i] == index[j] for j in range(4)]):
                    index[i] = bg_index[i + (num_proposals - 4), :]

        elif training == 'discriminator':
            # Generate x foreground and y background proposals
            num_proposals = 4
            proposal_range = range(num_proposals)

            fg_index = tf.stack(fg_index, axis=0)
            random_fg_index = tf.random_shuffle(tf.range(0, 4))

            index = [0 for x in proposal_range]
            index[0] = fg_index[random_fg_index[0], :]
            for l in range(num_proposals - 1):
                index[l + 1] = bg_index[l, :]

            tensor_index = tf.expand_dims(tf.stack(index, axis=0), axis=-1)
            tensor_labels = tf.expand_dims(tf.stack([1, 0, 0, 0], axis=0), axis=1)
            tensor_labels = tf.expand_dims(tf.tile(tensor_labels, [1, batch_size]), axis=-1)

            shuffled_index_and_labels = tf.random_shuffle(tf.concat([tensor_index, tensor_labels], axis=2))

            index = shuffled_index_and_labels[..., 0]
            labels = tf.cast(tf.transpose(shuffled_index_and_labels[..., 1]), tf.float32)

    else:
        proposal_range = range(mx * my)
        index = [tf.ones(batch_size, dtype=tf.int32) * q for q in proposal_range]

    # Run Decoder with proposals
    proposed_segmentations = [0 for x in proposal_range]
    scores = [0 for x in proposal_range]
    for q in proposal_range:

        # To share weights between proposals they have to be inititalized
        # for the first proposal and reused afterwards
        if q == 0:
            reuse = False
        else:
            reuse = True

        # Generate the one-hot proposal
        one_hot_proposal = tf.one_hot(index[q], mx * my)
        mask = tf.reshape(one_hot_proposal, [batch_size, mx, my, 1])
        # Apply the proposal
        masked = matched * mask
        ......

Hi, thanks for your comment. The code is a bit misleading here but the scene information is not used. However I also had to read through it again to understand what is happening and be sure this is the case.

What happens is:

The matching is created as a 2d map of correlations (BxHxWx1).
This map is multiplied with the target's representation, which corresponds to scaling this representation with the correlation factor at each location (BxHxWxC).
Then the one-hot proposal is multiplied with this tensor setting everything to 0 except for the scaled representation of the target at the one-shot location. (sparse BxHxWxC)
The following layer norm scales this by (x-mean) / std with the std being proportional to the correlation factor at the one-shot location:
```
decoder_input = slim.layer_norm(masked, scale=False, center=False, scope='matching_normalization') 
```
Therefore the effect is the same as never having scaled the target representation at all.

I have to admit, this is more than confusing and scaling the target with the matching instead of simply tiling it's representation is unnecessary. However as the layer norm effectively reverts the process it should be correct and working as stated in the paper.

Thanks for pointing us to this needless complication. I guess this is not the only place where the code is unnecessarily complex. I will try to find some time to go over it and remove similar confusing elements.

michaelisc / cluttered-omniglot

A question about MaskNet #1