Closed Azure-107 closed 9 months ago
Which inference function are you using? The appearance of this question seems to be due to BERT tokenization, which divides your text into smaller segments
I am following the suggestions in #17 and using the predict function from the official grounding dino implementation.
The reason for this situation is that the BERT tokenizer's vocabulary does not contain your label vocabulary, so BERT splits it into smaller subwords. In the official code, if you don't provide the ‘’token_spans‘’ parameter, it will directly match the label with all text tokens and output results in line 112 . You can consider using the ‘’token_spans‘’ parameter or follow our code's post-processing approach to treat the label as a whole and generate pos_maps in line 685.
Hello,
I have finetuned a model on my custom dataset using your implementation of grounding DINO. I am currently testing its performance by calling the inference function on unseen data. However, I noticed that the prediction function sometimes makes up non-existent class names that are not in the caption text input.
For example, when I used the caption "cadiere forceps . needle driver .", the results returned included classes like "cad forceps" or "##ps" as shown in the figure. I'm curious if you have any insights into why this might be happening. Thank you so much!