roboflow / notebooks

Examples and tutorials on using SOTA computer vision models and techniques. Learn everything from old-school ResNet, through YOLO and object-detection transformers like DETR, to the latest models like Grounding DINO and SAM.
https://roboflow.com/models
5.52k stars 872 forks source link

Potential bug in mAP computation of Florence-2 fine-tuning notebook #294

Open patel-zeel opened 3 months ago

patel-zeel commented 3 months ago

Search before asking

Notebook name

Fine-tuning Florence-2 on Object Detection Dataset

Bug

Predictions from Florence-2 fine-tuned model look like the following:

[Detections(xyxy=array([[ 52.8    , 237.76   , 169.28   , 470.08   ],
        [373.44   , 113.6    , 512.32   , 358.08   ],
        [161.59999, 330.56   , 301.75998, 585.27997],
        [311.36   , 360.     , 447.03998, 616.64   ],
        [173.12   ,  14.4    , 303.03998, 253.12   ]], dtype=float32), mask=None, confidence=array([1., 1., 1., 1., 1.]), class_id=array([34, 50, 46,  2, 33]), tracker_id=None, data={'class_name': array(['9 of hearts', 'queen of hearts', 'king of hearts', '10 of hearts',
        '9 of diamonds'], dtype='<U15')}),
 Detections(xyxy=array([[3.3056000e+02, 4.2559998e+01, 5.1679999e+02, 2.0703999e+02],
        [2.0128000e+02, 8.2239998e+01, 3.8112000e+02, 3.2351999e+02],
        [3.1999999e-01, 1.2959999e+02, 2.6719998e+02, 4.1312000e+02],
        [1.9808000e+02, 1.7375999e+02, 4.8863998e+02, 4.9887997e+02]],
       dtype=float32), mask=None, confidence=array([1., 1., 1., 1.]), class_id=array([16, 24, 32, 28]), tracker_id=None, data={'class_name': array(['5 of clubs', '7 of clubs', '9 of clubs', '8 of clubs'],
       dtype='<U10')}),
 Detections(xyxy=array([[369.6    , 234.56   , 517.44   , 490.56   ],
        [ 87.36   , 163.51999, 255.04   , 402.24   ]], dtype=float32), mask=None, confidence=array([1., 1.]), class_id=array([35, 44]), tracker_id=None, data={'class_name': array(['9 of spades', 'king of clubs'], dtype='<U17')}),
 Detections(xyxy=array([[ 56.     , 228.79999, 331.84   , 636.48   ]], dtype=float32), mask=None, confidence=array([1.]), class_id=array([31]), tracker_id=None, data={'class_name': array(['8 of spades'], dtype='<U13')})]

It seems that the confidence score is always 1. Wouldn't this cause an issue in creating the precision-recall curve followed by computing mAP?

Environment

NA

Minimal Reproducible Example

NA

Additional

NA

Are you willing to submit a PR?

SkalskiP commented 3 months ago

Yup, it is not perfect. We try to apply traditional computer vision metrics to models that exist outside of traditional computer vision space. In the case of Florence-2, it is a VLM. When VLM performs object detection, all of the boxes have the same probability - confidence 100%.

patel-zeel commented 3 months ago

Thank you for your response, @SkalskiP. I was wondering what a fair comparison would be in such cases. For example, should we also convert traditional models' confidence scores to 1 before computing mAP?

SkalskiP commented 3 months ago

I don't know how to do it now. However, given the growth of VLMs over the past 1-2 years, I think it will be an important issue if we measure VLM performance.

SkalskiP commented 3 months ago

@patel-zeel, your question motivated me to reach out to Lucas Beyer, he is leading the team behind PaliGemma. Looks like there is no better way to do it than just mAP with confidence = 100. He suggested using both AP and AR for more diverse comparison.

patel-zeel commented 3 months ago

@SkalskiP Thank you for the update and follow-up on this! Great to hear the feedback from the PaliGemma lead.

He suggested using both AP and AR for a more diverse comparison.

If I understand correctly, it means,

That sounds reasonable and motivates me to look even deeper into this.

SkalskiP commented 3 months ago

That's what I'll do for now. Yup.