Can you train for keypoint detection?

rujiao commented 7 years ago

Hi @waleedka : Thanks for the great work! Is it possible to train for keypoint detection? Sorry for the wrong title of the issue, I can't correct it.

QtSignalProcessing commented 6 years ago

@Superlee506 Yes, you r right, you can use tf.ceil() instead of tf.round()

Superlee506 commented 6 years ago

@QtSignalProcessing Copy that, thanks so much~I'm confused with it these days. Does your loss converges eventually?

QtSignalProcessing commented 6 years ago

@Superlee506 Yes. BTW, @RodrigoGantier 's code is a very good reference, I benefited a lot from his work.

Superlee506 commented 6 years ago

@QtSignalProcessing I also referred to his work. Thanks for your suggestion. Finally, my loss start converging, haha~ . Does this loss look right at the start?

Superlee506 commented 6 years ago

@filipetrocadoferreira My loss can't converge too, how do you fix your issue ?

Superlee506 commented 6 years ago

@QtSignalProcessing I used your loss function, but the it doesn't converge, can you give me some suggestions? It confused me these days.

QtSignalProcessing commented 6 years ago

@Superlee506 I suppose you've tried to reduce the learning rate. From my experience, a possible reason of causing NaN or exploring loss may be there are some errors in your label pre-processing steps, especially when you re-write the utils.resize_mask(), utils.minimize_mask() (and perhaps other) functions for key point masks. Otherwise I have no further idea.

Superlee506 commented 6 years ago

@QtSignalProcessing I inspected and visualized all of my input and I'm sure it is right. I think there is something wrong with my loss function.

QtSignalProcessing commented 6 years ago

@Superlee506 I suggest you to check your code again to make sure that pred_masks and target_masks in your loss function graph are of the same shape. If you are using Rodrigo's code with the loss function I posted, I think it's better to use detectron. BTW, I also checked the source code of detectron, the loss function for key point detection is softmaxwithloss in caffe2, which is exactly the same meaning as I posted.

Superlee506 commented 6 years ago

@QtSignalProcessing I used Rodrigo's code, but I change it to COCO dataset. I check my code again and again and I find the only possibly wrong place is the using of "tf.ceil()" after tf.image.crop_and_resize in “DetectionTargetLayer”. It will result multiple 1 in the mask. So my cross-entropy loss is nearly 32.0 and hard to converge. Do you know how to fix this?

QtSignalProcessing commented 6 years ago

@Superlee506 Multiple 1s should be normal at starting stages.

How many iterations have you trained your net? In my case, I used the following training strategy:

`model.train(dataset_train, dataset_val, learning_rate=config.LEARNING_RATE, epochs=15, layers='heads')

    # Training - Stage 2
    # Finetune layers from ResNet stage 4 and up
    print("Training Resnet layer 4+")
    model.train(dataset_train, dataset_val,
                learning_rate=config.LEARNING_RATE / 10,
                epochs=20,
                layers='4+')

    # Training - Stage 3
    # Finetune layers from ResNet stage 3 and up
    print("Training Resnet layer 3+")
    model.train(dataset_train, dataset_val,
                learning_rate=config.LEARNING_RATE / 100,
                epochs=100,
                layers='all')`

The learning rate is 0.002 and STEPS_PER_EPOCH is 1000. My key point losses are from some value less than 10 to 23 then finally converged to some value around 7.

I may suggest you: 1. train your net with more epochs and 2. ADD your key point head in parallel of the mask head ( NOT use the key point head to replace the mask head ). Your final loss function should be L = L_cls + L_box + L_mask + L_kptmask.

Superlee506 commented 6 years ago

@QtSignalProcessing I‘m really very thankful for your patience and suggestions. I trained my model more than 50000 iterations and the loss still near 30 and I removed the mask loss. Maybe I should add them.

I plan to use your proposed loss function but I think it is the same with "tf.nn.softmax_cross_entropy_with_logits" because in the softmax_cross_entropy_with_logits function it provide the same "eps " to deal with log(0).

My loss function just like this: I filter the negative invisible and zero mask through 3 Steps. And I check the filtered results, it's right.

`def keypoint_mrcnn_mask_loss_graph(target_keypoint_masks, target_keypoint_weights, target_class_ids, pred_keypoint_masks,mask_shape=[56,56],number_point=17):

"""Mask softmax cross-entropy loss for the keypoint head.
target_keypoint_mask: [batch, num_rois, height, width, num_keypoints].
    A float32 tensor of values 0 or 1. Uses zero padding to fill array.
keypoint_weight:[num_person, num_keypoint]
    0: not visible and without annotations
    1: not visible but with annotations
    2: visible and with annotations
target_class_ids: [batch, num_rois]. Integer class IDs. Zero padded.
pred_keypoint_masks: [batch, proposals, height, width, num_keypoints] float32 tensor
            with values from 0 to 1.
"""
# Reshape for simplicity. Merge first two dimensions into one.
#shape:[N]
target_class_ids = K.reshape(target_class_ids, (-1,))
# Only positive person ROIs contribute to the loss. And only
# the people specific mask of each ROI.
positive_people_ix = tf.where(target_class_ids > 0)[:, 0]
positive_people_ids = tf.cast(
    tf.gather(target_class_ids, positive_people_ix), tf.int64)

###Step 1 Get the positive target and predict keypoint masks
    # reshape target_keypoint_weights to [N, num_keypoints]
target_keypoint_weights = K.reshape(target_keypoint_weights, (-1, number_point))
    # reshape target_keypoint_masks to [N, 56*56, 17]
target_keypoint_masks = K.reshape(target_keypoint_masks, (
    -1, mask_shape[0] * mask_shape[1], number_point))
    # Permute target_keypoint mask to [N, num_keypoints, height*width]
target_keypoint_masks = tf.transpose(target_keypoint_masks, [0, 2, 1])
    # reshape pred_keypoint_masks to [N, 56*56, 17]
pred_keypoint_masks = K.reshape(pred_keypoint_masks,
                                (-1, mask_shape[0]*mask_shape[1], number_point))
    # Permute predicted masks to [N, num_keypoints, height*width]
pred_keypoint_masks = tf.transpose(pred_keypoint_masks, [0, 2, 1])
    # Gather the keypoint masks (target and predict) that contribute to loss
    # shape: [N_positive, num_annonated_keypoints, height*width]
positive_target_keypoint_masks = tf.gather(target_keypoint_masks, positive_people_ix)
positive_pred_keypoint_masks = tf.gather(pred_keypoint_masks, positive_people_ix)
    # positive target_keypoint_weights to[N_positive, num_keypoints]
positive_keypoint_weights = tf.cast(
    tf.gather(target_keypoint_weights, positive_people_ix), tf.int64)

##Step 2 get the visible and annonated keypoint maskt that contribute to loss
    #reshape positive_keypoint_weights to [N_positive*17]
positive_keypoint_weights = K.reshape(positive_keypoint_weights,(-1,))
annonated_keypoint_ix = tf.where(positive_keypoint_weights > 0)[:,0]
    #reshape target and predict keypoint mask to [N_positive*17, 56*56]
positive_target_keypoint_masks =K.reshape(positive_target_keypoint_masks, (-1, mask_shape[0]*mask_shape[1]))
positive_pred_keypoint_masks = K.reshape(positive_pred_keypoint_masks, (-1, mask_shape[0]*mask_shape[1]))
    #get the visible and annonated keypoint maskt
y_true = tf.gather(positive_target_keypoint_masks, annonated_keypoint_ix)
y_pred = tf.gather(positive_pred_keypoint_masks, annonated_keypoint_ix)

##Step 3 Get none zero mask. Because of ROI crop, some target keypoint mask may be all zeros.
y_true_sum = tf.reduce_sum(y_true, axis=-1)
good_ids = tf.where(y_true_sum > 0)[:, 0]
    #get the none zero mask shape:[N_none_zero, 56*56]
y_true = tf.gather(y_true, good_ids)
y_pred = tf.gather(y_pred, good_ids)

# shape: [N_none_zero, 56 * 56]
labels = tf.to_float(y_true)
eps = tf.constant(value=1e-4)
softmax = tf.nn.softmax(y_pred) + eps
cross_entropy = -tf.reduce_sum(
    labels * tf.log(softmax), reduction_indices=[1])
loss = K.switch(tf.size(labels) > 0,
                lambda: tf.reduce_mean(cross_entropy),
                lambda: tf.constant(0.0))

return loss`

Superlee506 commented 6 years ago

@QtSignalProcessing This is my keypoint mask head graph. When I followed your suggestion, the loss at the first stage was just like this. My inpput keypoint mask is absolutely right because I check every line of the code and visualize it.

`x = PyramidROIAlign([pool_size, pool_size], image_shape, name="roi_align_keypoint_mask")([rois] + feature_maps) for i in range(8): x = KL.TimeDistributed(KL.Conv2D(512, (3, 3), padding="same"), name="mrcnn_keypoint_mask_conv{}".format(i + 1))(x)

    x = KL.TimeDistributed(BatchNorm(axis=3),
                           name='mrcnn_keypoint_mask_bn{}'.format(i + 1))(x)
    x = KL.Activation('relu')(x)

x = KL.TimeDistributed(KL.Conv2DTranspose(512, (2, 2), strides=2,activation="relu"),
                       name="mrcnn_keypoint_mask_deconv")(x)

x =  KL.TimeDistributed(
    KL.Lambda(lambda z: tf.image.resize_bilinear(z, [28, 28]),name="mrcnn_keypoint_mask_upsample_1"))(x)
x = KL.TimeDistributed(
    KL.Lambda(lambda z: tf.image.resize_bilinear(z, [56, 56]), name="mrcnn_keypoint_mask_upsample_2"))(x)

x = KL.TimeDistributed(KL.Conv2D(num_keypoints, (1, 1), strides=1,activation="sigmoid"),
                       name="mrcnn_keypoint_mask")(x)`

QtSignalProcessing commented 6 years ago

@Superlee506 My code is not the same as yours from the mrcnn_keypoint_mask_deconv layer.

`x = KL.TimeDistributed(KL.Conv2DTranspose(num_key_pts, (2, 2), strides=2), name="mrcnn_kpt_mask_deconv")(x)

x = KL.TimeDistributed(KL.Conv2DTranspose(num_key_pts, (4, 4), strides=2, padding="same", kernel_initializer=
keras.initializers.Constant(bilinear_upsample_weights(factor=2, number_of_classes=num_key_pts))),
                       name="mrcnn_kpt_mask_deconv_upscale")(x)

return x

def upsample_filt(size): factor = (size + 1) // 2 if size % 2 == 1: center = factor - 1 else: center = factor - 0.5 og = np.ogrid[:size, :size] return (1 - abs(og[0] - center) / factor) * (1 - abs(og[1] - center) / factor)

def bilinear_upsample_weights(factor, number_of_classes): filter_size = factor*2 - factor%2 weights = np.zeros((filter_size, filter_size, number_of_classes, number_of_classes), dtype=np.float32) upsample_kernel = upsample_filt(filter_size) for i in range(number_of_classes): weights[:, :, i, i] = upsample_kernel return weights`

QtSignalProcessing commented 6 years ago

@Superlee506 My kpt loss function: `def mrcnn_kpt_mask_loss_graph(target_masks, target_class_ids, pred_masks):

num_kpt = 19
target_class_ids = K.reshape(target_class_ids, (-1,))
positive_ix = tf.where(target_class_ids > 0)[:, 0]
target_masks = K.reshape(target_masks, (-1, 56 * 56, num_kpt))
pred_masks = K.reshape(pred_masks, (-1, 56 * 56, num_kpt))
# Gather the masks (predicted and true) that contribute to loss
y_true = tf.gather(target_masks, positive_ix)
y_pred = tf.gather(pred_masks, positive_ix)
loss = []
for ii in range(0, num_kpt):
    logits = y_pred[:,:,ii]
    eps = tf.constant(value=1e-4)
    labels = tf.to_float(y_true[:,:,ii])
    softmax = tf.nn.softmax(logits) + eps
    cross_entropy = -tf.reduce_sum(
        labels * tf.log(softmax), reduction_indices=[1])
    cross_entropy_mean = K.switch(tf.size(labels) > 0, tf.reduce_mean(cross_entropy),
                                  tf.constant(0.0))
    loss.append(cross_entropy_mean)
loss = tf.stack(loss)
loss = K.mean(loss)
return loss`

Superlee506 commented 6 years ago

@QtSignalProcessing The number of keypoint in coco dataset is 17, why did you use 19?

QtSignalProcessing commented 6 years ago

@Superlee506 I am not using coco

Superlee506 commented 6 years ago

@QtSignalProcessing Finally, I checked the original Detectron code for human pose estimation, and changed my code. The loss converged, but the detection result isn't as good as the original paper. And the model can't distinguish between right or left shoulder, right or left knee, etc. no matter when I used the flipping augment or not.

filipetrocadoferreira commented 6 years ago

What did you changed?

Superlee506 commented 6 years ago

@filipetrocadoferreira A lot of places, and I find many mistakes in RodrigoGantier code. Firstly, I changed the ground truth keypoint as label type. Secondly the loss function, I added weights in the keypoint loss function as the Detectron did, and then the sparse_softmax_cross_entropy_with_logits converges quickly. What's more importantly, the flipping method isn't right for keypoints, and we need some modifications. However, my results doesn't as good as the oringal paper. I 'm confused about it. I plan to submit my code when the results seem good.

filipetrocadoferreira commented 6 years ago

Nice! I also found the way to deal with keypoint ground-truth can't be the same as the mask (because resizes and crops will prolly make clear it)

the code would be amazing

Superlee506 commented 6 years ago

@filipetrocadoferreira @QtSignalProcessing @RodrigoGantier @racinmat I opensource my project with detailed code comments. The loss can converge quickly, but the predicted results are not as good as the original paper. I just have one 980 graphics card, so I release my code, and any contribution or improvement is welcome and appreciated. https://github.com/Superlee506/Mask_RCNN

QtSignalProcessing commented 6 years ago

@Superlee506 It's really hard to achieve results reported in the original paper since the training parameters should be carefully selected ( I read sth like this in one of the issues in Detectron ). BTW, distinguishing left/ right key points replies on the geometric information of human body, this could be done by post processing.

Superlee506 commented 6 years ago

@QtSignalProcessing How to do the post processing? In my case, my model usually output the left/right key point together.

Superlee506 commented 6 years ago

https://github.com/Superlee506/Mask_RCNN_Humanpose, I change the name of this repository.

QtSignalProcessing commented 6 years ago

@Superlee506 Positions of nose and eyes provide information that you can use for distinguishing left and right. Otherwise you should have some assumptions.

Sorry for my last comment, I used wrong words. The best way to distinguish left and right is to change the key point head to model the key points relationships.

hdjsjyl commented 6 years ago

@RodrigoGantier , thanks for your advice. It is really helpful to me. But for your following code, I have a little problems. First: I know the 14 should be the number of keypoint. Does it include the background? Second: And how to get the positive_ix, getting the postive_idx from results for rcnn part or other places? Third: Because for every keypoint, you compute a loss, when you output the keypoint detection loss, should compute the mean of them? Any advice will be appreciated! thank you very much.

pred_masks = K.reshape(pred_masks, (-1, 784, 14)) target_masks = K.reshape(target_masks, (-1, 784, 14))

Gather the masks (predicted and true) that contribute to loss y_true = tf.gather(target_masks, positive_ix) y_pred = tf.gather(pred_masks, positive_ix)

loss = [] for i in range(14): loss.append(tf.nn.softmax_cross_entropy_with_logits(logits=y_pred[:, :, i], labels=y_true[:, :, i])) loss = tf.stack(loss)

hdjsjyl commented 6 years ago

@rujiao @waleedka @RodrigoGantier , thanks for your advice. Now I have changed segmentation part for keypoint detection. But I found that rcnnL1Loss will be very big such as 234677418896143419441152.0000. Do you know what is the reason? Any advice will be appreciated. Thank you.

VellalaVineethKumar commented 5 years ago

@rujiao @waleedka @taewookim @liu6381810 @MaeThird can anyone please help me get a feature vector of the masked region? I'm really struggling to get it I first referred to this Issue: https://github.com/matterport/Mask_RCNN/issues/1249 and then https://github.com/matterport/Mask_RCNN/issues/1190 i did all the changes mentioned there but i get errors like "positional argument required "roi_pooled_features". I'm still stuck here for a month and any help would be really appreciated.

germanotm commented 4 years ago

Has anyone been successful in using the Mask RCNN to detect only keypoints?

rujiao commented 4 years ago

Yes, I have used Mask-RCNN to detect bbox and keypoints. It works quick well. You can simply remove the mask part

xidaniel commented 4 years ago

@rujiao Could you share your code in Github?

matterport / Mask_RCNN

Can you train for keypoint detection? #2