longzw1997 / Open-GroundingDino

This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
MIT License
386 stars 60 forks source link

Tokenizer decoder makes up class names during inference #40

Closed Azure-107 closed 9 months ago

Azure-107 commented 9 months ago

Hello,

I have finetuned a model on my custom dataset using your implementation of grounding DINO. I am currently testing its performance by calling the inference function on unseen data. However, I noticed that the prediction function sometimes makes up non-existent class names that are not in the caption text input.

For example, when I used the caption "cadiere forceps . needle driver .", the results returned included classes like "cad forceps" or "##ps" as shown in the figure. I'm curious if you have any insights into why this might be happening. Thank you so much!

Screenshot 2023-11-25 at 4 06 11 AM

longzw1997 commented 9 months ago

Which inference function are you using? The appearance of this question seems to be due to BERT tokenization, which divides your text into smaller segments

Azure-107 commented 9 months ago

I am following the suggestions in #17 and using the predict function from the official grounding dino implementation.

longzw1997 commented 9 months ago

The reason for this situation is that the BERT tokenizer's vocabulary does not contain your label vocabulary, so BERT splits it into smaller subwords. In the official code, if you don't provide the ‘’token_spans‘’ parameter, it will directly match the label with all text tokens and output results in line 112 . You can consider using the ‘’token_spans‘’ parameter or follow our code's post-processing approach to treat the label as a whole and generate pos_maps in line 685.