Closed YueLiao closed 2 years ago
Thanks for your interest in our work.
In our experiments, we focus on studying how to better fine-tune the CLIP models and leverage the language priors. Therefore, we choose some widely used frameworks like Semantic FPN and Mask RCNN. So we didn't test the performance on anchor-free detectors or DETR. I think it may be not simple to directly use s
or the text encoder to perform the object detection task since there is a significant gap between the pre-training tasks that focus on semantic information (e.g., categories) and the object localization tasks of center/conner/DETR decoder.
Thx for your reply~
Thx for your interesting and nice work!
I would like to know whether you conduct experiments on anchor-free detectors, e.g., a). adopting the score maps
s
as the centre/conner points heatmaps to obtain bounding-boxes predictions or b). employingpre-model prompting
for a transformer detector directly (use text embeddings to initialize classifier), like DETR.