I used your implementation to train a model on a custom dataset consisting of only 10 images for 500 epochs, during which I expected the model to be able to memorize the provided images. I then passed the same image I used for training and the weight obtained to the official grounding dino inference script to test its performance.
The model exhibited promising results by correctly drawing bounding boxes and accurately predicting the class. However, I observed a notable discrepancy in the confidence scores (as shown in the attached image). Despite the model's correct predictions, the confidence scores were unexpectedly low.
I am wondering if you could kindly provide any guidance or suggestions on why there might be such a difference between the model's predictions and the confidence scores. Any insights would be greatly appreciated. Thank you so much for your time and support :))
10 images for 500 epochs may not be a good setting. BTW, 500~1000 images with 5~10 epoch sounds reasonable?
I think grasper = 0.5 is an acceptable score
What about trying to use few-shot not fine-tune directly? try OWL-V2?
If you just want to do a OD task(close-set object detection, and only very few samples), perhaps you can try to fine-tune on some other networks(co-detr, EVA, DINO,internimage, etc. ) or some few-shot learning method.
Thank you so much for the amazing work!
I used your implementation to train a model on a custom dataset consisting of only 10 images for 500 epochs, during which I expected the model to be able to memorize the provided images. I then passed the same image I used for training and the weight obtained to the official grounding dino inference script to test its performance.
The model exhibited promising results by correctly drawing bounding boxes and accurately predicting the class. However, I observed a notable discrepancy in the confidence scores (as shown in the attached image). Despite the model's correct predictions, the confidence scores were unexpectedly low.
I am wondering if you could kindly provide any guidance or suggestions on why there might be such a difference between the model's predictions and the confidence scores. Any insights would be greatly appreciated. Thank you so much for your time and support :))