Potential issue in the evaluation code

Uio96 commented 3 years ago

Thanks a lot for this great dataset.

My colleague @swtyree and I have taken a close look at your evaluation code and found some potential issues in it.

The first issue is about the visibility label.

Although the dataset provided it, you did not use the obtained index to extract the instance for it. As a result, the lengths may not match because label[VISIBILITY] includes the entries for objects that are below the visibility threshold while label[LABEL_INSTANCE] does not.

Before:

label[VISIBILITY] = visibilities index = visibilities > self._vis_thresh

I think it should be:

index = visibilities > self._vis_thresh label[VISIBILITY] = visibilities[index]

Here is the corresponding part from the evaluation code: https://github.com/google-research-datasets/Objectron/blob/master/objectron/dataset/parser.py#L50-L53

The second issue is with the calculation of average precision.

I found that the testing order of the images would affect the final result.

The original process in the classification/segmentation works has an important step which sorts the results by the predicted confidence. See https://github.com/ShawnNew/Detectron2-CenterNet/blob/master/detectron2/evaluation/pascal_voc_evaluation.py#L243-L246. However, I did not find it in your code. I am not sure if you assumed the tested methods would do that somewhere else, or you just fixed the order of testing images.

Here is the corresponding part from the evaluation code: https://github.com/google-research-datasets/Objectron/blob/master/objectron/dataset/metrics.py#L86-L98

It is similar to the process used in the pascal_voc_evaluation:https://github.com/ShawnNew/Detectron2-CenterNet/blob/master/detectron2/evaluation/pascal_voc_evaluation.py#L290-L299

I am looking forward to your reply. Thank you so much.

ahmadyan commented 3 years ago

Hi Yunzhi, Thanks for the detailed feedback. I appreciate it.

Regarding visibility, we set the visibility mostly to 1.0 for all instances, so you can pretty much ignore it (there are few instances of 0.0 where the object is outside the frame, If you want to see how it is actually calculated see #37 ). So threshold is not used in 3D object detection.

Furthermore, it doesn't make sense to me for the label['VISIBILITY'] length to match with other labels (such as label[LABEL_INSTANCE_3D], etc). It is the list of all visibible labels, so we don't want to drop those labels that are set to 0., Later when we actually check for visibility here we are careful enough to skip those invisible object instances. Let me know what you think and what would be the effects of applying your change to the output.

As you linked it, the eval code is based on the reference MATLAB code of the pascal voc. However, our model is a two stage network. First we detect the object's crop (based on confidence using MobileNet or other detectors) and then we pass it to the second network to estimate the pose (objectron models). In the second stage, we assume the object exists in the crop with probability 1.0 and the network estimates the keypoints. Here we do not predict any confidence (network assumes object is within the bounding box). Thus we do not need to sort the predictions.

Now, if above eval doesn't work for you or you model predicts confidence and you need that to be included in the evaluation let me know and we can accomodate it, or you can create a pull request.

guthasaibharathchandra commented 1 year ago

Hi, I have recently noticed that in the evaluation code, for calculating recall, you directly divide the true positives (over all predictions) over the total instances in all images. This is fine, but when you are computing true positives, say if my model predicts multiple bounding boxes that match with same ground truth instance, then all the predictions are considered true positives according to the code while actually the number of true positives should be 1 (other predictions matching with same ground truth instance should be considered false positives right?, because otherwise my model could predict literally 100 bboxes that match with one gt instance and if number of gt instances are smaller then the recall value is > 1 which make no sense). I hope my question was clear. I'm looking forward to your answer! Thanks!

google-research-datasets / Objectron

Potential issue in the evaluation code #48