Discrepency between the model's predictions and the confidence scores

longzw1997 / Open-GroundingDino

This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.

MIT License

396 stars 62 forks source link

Thank you so much for the amazing work!

I used your implementation to train a model on a custom dataset consisting of only 10 images for 500 epochs, during which I expected the model to be able to memorize the provided images. I then passed the same image I used for training and the weight obtained to the official grounding dino inference script to test its performance.

The model exhibited promising results by correctly drawing bounding boxes and accurately predicting the class. However, I observed a notable discrepancy in the confidence scores (as shown in the attached image). Despite the model's correct predictions, the confidence scores were unexpectedly low.

I am wondering if you could kindly provide any guidance or suggestions on why there might be such a difference between the model's predictions and the confidence scores. Any insights would be greatly appreciated. Thank you so much for your time and support :))

annotated_image

Some answers that may be helpful to you：
More：
- 10 images is not enough?
- 10 images for 500 epochs may not be a good setting. BTW, 500~1000 images with 5~10 epoch sounds reasonable?
- I think grasper = 0.5 is an acceptable score
- What about trying to use few-shot not fine-tune directly? try OWL-V2?
- If you just want to do a OD task(close-set object detection, and only very few samples), perhaps you can try to fine-tune on some other networks(co-detr, EVA, DINO,internimage, etc. ) or some few-shot learning method.

longzw1997 / Open-GroundingDino

Discrepency between the model's predictions and the confidence scores #33