Open Uio96 opened 3 years ago
Hi Yunzhi, Thanks for the detailed feedback. I appreciate it.
Furthermore, it doesn't make sense to me for the label['VISIBILITY'] length to match with other labels (such as label[LABEL_INSTANCE_3D], etc). It is the list of all visibible labels, so we don't want to drop those labels that are set to 0., Later when we actually check for visibility here we are careful enough to skip those invisible object instances. Let me know what you think and what would be the effects of applying your change to the output.
Now, if above eval doesn't work for you or you model predicts confidence and you need that to be included in the evaluation let me know and we can accomodate it, or you can create a pull request.
Hi, I have recently noticed that in the evaluation code, for calculating recall, you directly divide the true positives (over all predictions) over the total instances in all images. This is fine, but when you are computing true positives, say if my model predicts multiple bounding boxes that match with same ground truth instance, then all the predictions are considered true positives according to the code while actually the number of true positives should be 1 (other predictions matching with same ground truth instance should be considered false positives right?, because otherwise my model could predict literally 100 bboxes that match with one gt instance and if number of gt instances are smaller then the recall value is > 1 which make no sense). I hope my question was clear. I'm looking forward to your answer! Thanks!
Thanks a lot for this great dataset.
My colleague @swtyree and I have taken a close look at your evaluation code and found some potential issues in it.
Although the dataset provided it, you did not use the obtained index to extract the instance for it. As a result, the lengths may not match because
label[VISIBILITY]
includes the entries for objects that are below the visibility threshold whilelabel[LABEL_INSTANCE]
does not.Before:
I think it should be:
Here is the corresponding part from the evaluation code: https://github.com/google-research-datasets/Objectron/blob/master/objectron/dataset/parser.py#L50-L53
I found that the testing order of the images would affect the final result.
The original process in the classification/segmentation works has an important step which sorts the results by the predicted confidence. See https://github.com/ShawnNew/Detectron2-CenterNet/blob/master/detectron2/evaluation/pascal_voc_evaluation.py#L243-L246. However, I did not find it in your code. I am not sure if you assumed the tested methods would do that somewhere else, or you just fixed the order of testing images.
Here is the corresponding part from the evaluation code: https://github.com/google-research-datasets/Objectron/blob/master/objectron/dataset/metrics.py#L86-L98
It is similar to the process used in the pascal_voc_evaluation:https://github.com/ShawnNew/Detectron2-CenterNet/blob/master/detectron2/evaluation/pascal_voc_evaluation.py#L290-L299
I am looking forward to your reply. Thank you so much.