open-mmlab / mmpose

OpenMMLab Pose Estimation Toolbox and Benchmark.
https://mmpose.readthedocs.io/en/latest/
Apache License 2.0
5.8k stars 1.24k forks source link

Low COCO Evaluation results although acc_pose is high #770

Closed rubeea closed 3 years ago

rubeea commented 3 years ago

Hi, I am trying to implement a new data augmentation technique into the train pipeline. I have incorporated the technique after TopDownAffine in the train pipeline. When I start training the model, I get good acc_pose values and the loss also decreases however when the evaluation is done (after 50 epochs), the resulting metrics are very poor:

Average Precision (AP) @[ IoU=0.50:0.95 | type= all | maxDets= 20 ] = 0.001 Average Precision (AP) @[ IoU=0.50 | type= all | maxDets= 20 ] = 0.009 Average Precision (AP) @[ IoU=0.75 | type= all | maxDets= 20 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | type=medium | maxDets= 20 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | type= large | maxDets= 20 ] = 0.002 Average Recall (AR) @[ IoU=0.50:0.95 | type= all | maxDets= 20 ] = 0.015 Average Recall (AR) @[ IoU=0.50 | type= all | maxDets= 20 ] = 0.087 Average Recall (AR) @[ IoU=0.75 | type= all | maxDets= 20 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | type=medium | maxDets= 20 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | type= large | maxDets= 20 ] = 0.017

What can be the possible reason for this? Should I incorporate the augmentation method in the val pipeline as well? The method simply merges the image with its mask image and returns the result.

jin-s13 commented 3 years ago

What dataset did you use?

rubeea commented 3 years ago

What dataset did you use?

@jin-s13 It is a custom dataset of power line objects with 3 keypoints as described here ([https://github.com/open-mmlab/mmpose/issues/707#issue-913455927]). The dataset has 200 images in total (160 training and 40 validation). I am using mobilenetv3 to train my detector. Image size 256x256.

jin-s13 commented 3 years ago

OK.

  1. You can visualize the results and check whether it is because the misuse of the COCO evaluation tool.
  2. It seems that you are using COCO evaluation tool. It requires sigmas as the input
  3. https://github.com/open-mmlab/mmpose/blob/3e722c6645480b5974f9cfa8e76d5f38bff65876/mmpose/datasets/datasets/top_down/topdown_coco_dataset.py#L95

You should also check sigmas during evaluation.

rubeea commented 3 years ago

OK.

  1. You can visualize the results and check whether it is because the misuse of the COCO evaluation tool.
  2. It seems that you are using COCO evaluation tool. It requires sigmas as the input

https://github.com/open-mmlab/mmpose/blob/3e722c6645480b5974f9cfa8e76d5f38bff65876/mmpose/datasets/datasets/top_down/topdown_coco_dataset.py#L95

You should also check sigmas during evaluation.

@jin-s13 how can I calculate the sigma values for my own dataset? Or should I use another evaluation metric such as PCK, AUC or EPE? which one is preferable to use?

jin-s13 commented 3 years ago

The sigmas measure the labeling error (the std of annotator).

mAP are useful when there are multiple objects in an image. And I recommend using PCK and AUC for evaluation if the image only contains 1 object.

rubeea commented 3 years ago

The sigmas measure the labeling error (the std of annotator).

mAP are useful when there are multiple objects in an image. And I recommend using PCK and AUC for evaluation if the image only contains 1 object.

I have multiple objects in the image so I believe that mAP is more suitable. How can we measure sigma if the keypoints are automatically labelled or just one set of keypoints is available that is they are not annotated by multiple people. Can we use any default values?

jin-s13 commented 3 years ago

It affects the mAP a lot. If you do not want to punish keypoints with small errors, you can set a higher sigma.

BTW, please also note that for top-down methods, bbox should be provided for evaluation. Did you prepare your detection bbox? Or did you use gt bbox?

rubeea commented 3 years ago

It affects the mAP a lot. If you do not want to punish keypoints with small errors, you can set a higher sigma.

BTW, please also note that for top-down methods, bbox should be provided for evaluation. Did you prepare your detection bbox? Or did you use gt bbox?

What values should I set? <1 or >1? Currently I was using self.sigmas = np.array([ .26, .25, .25 ]) / 10.0 as my sigma values. I am not evaluating on the test set but rather on the validation set only with workflow [('train',1)] and validate=True option. So I am using the gt bboxes for the validation dataset.

jin-s13 commented 3 years ago

You may try larger sigma, as 0.25 is very strict.
e.g. [1.0, 1.0, 1.0] / 10.

see https://github.com/open-mmlab/mmpose/blob/3e722c6645480b5974f9cfa8e76d5f38bff65876/configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/hrnet_w32_coco_256x192.py#L98

Did you set it as True?

rubeea commented 3 years ago

You may try larger sigma, as 0.25 is very strict. e.g. [1.0, 1.0, 1.0] / 10.

see

https://github.com/open-mmlab/mmpose/blob/3e722c6645480b5974f9cfa8e76d5f38bff65876/configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/hrnet_w32_coco_256x192.py#L98

Did you set it as True?

Ok noted. No it is set to false. Should I set use_gt_bbox to true?

jin-s13 commented 3 years ago

Yes, set use_gt_bbox=True.

rubeea commented 3 years ago

Yes, set use_gt_bbox=True.

@jin-s13 thanks for all the suggestions. I did as you suggested but still the coco eval metrics are very poor. The loss value and acc_pose are good then why arent the coco metrics improving?

2021-07-08 14:08:52,599 - mmpose - INFO - Epoch [51][50/84] lr: 5.000e-04, eta: 0:12:35, time: 0.094, data_time: 0.043, memory: 143, mse_loss: 0.0007, acc_pose: 0.7867, loss: 0.0007

Also, this is happening when I try to add a new data augmentation technique at the start or end of train pipeline. If I do not add the technique, the COCO eval metrics are good (approximately 75%). Moreover, if I add the proposed augmentation technique to the val pipeline (test time augmentation) I am getting good results.

jin-s13 commented 3 years ago

Sorry for the late reply. I am not very sure what data augmentation did you use. But it is possible, if the network input is heavily changed by the data augmentation. In this case, the learning of the network will be biased.

rubeea commented 3 years ago

Sorry for the late reply. I am not very sure what data augmentation did you use. But it is possible, if the network input is heavily changed by the data augmentation. In this case, the learning of the network will be biased.

Hi @jin-s13, Thanks for your reply. One point to be noted here is that when I use the augmentation technique in the val pipeline it does give good results just as you have used TopDownAffine in both train and val pipeline. Does that count as test time augmentation?

jin-s13 commented 3 years ago

Data augmentation is to make the distribution of training data similar to that of the test data. Popular data augmentation tricks are random shift, flip, rotate. The idea is to transform the training set to cover all possible cases in the test set.

"The method simply merges the image with its mask image" If I understand correctly, it seems that in your method the input format is changed. The purpose is not to mimic the distribution of the test set. So it is not a general data augmentation technique.

To achieve good performance, we have to make the distribution of trainset and distribution of testset as similar as possible. That's why you will obtain good performance when you use it in the val pipeline.

rubeea commented 3 years ago

Data augmentation is to make the distribution of training data similar to that of the test data. Popular data augmentation tricks are random shift, flip, rotate. The idea is to transform the training set to cover all possible cases in the test set.

"The method simply merges the image with its mask image" If I understand correctly, it seems that in your method the input format is changed. The purpose is not to mimic the distribution of the test set. So it is not a general data augmentation technique.

To achieve good performance, we have to make the distribution of trainset and distribution of testset as similar as possible. That's why you will obtain good performance when you use it in the val pipeline.

@jin-s13 yes I get your point but if the test set itself is very small or limited don't you think it is a good idea to increase both the train and test sets via augmentation techniques as it increases the overall evaluation metrics results?

innerlee commented 3 years ago

Yes it is test time augmentation

rubeea commented 3 years ago

Yes it is test time augmentation

Ok noted. Thanks for your reply.