Closed Edwardmark closed 1 year ago
Hi @Edwardmark,
Thank you for showing interest in our work.
Our approach addresses the challenges of open vocabulary object detection, where the goal is to enable the model to recognize a larger set of objects beyond the known/base classes without requiring extensive training on these new categories. This is particularly important because, in standard object detection, rare categories can be difficult to train for due to the long-tailed distribution of object categories, known as Zipf's law. This means that more data is often required to scale up detection vocabularies, which can be expensive and time-consuming to annotate.
To overcome this challenge, we propose a weakly-supervised approach that uses image-text/label pairs from large classification datasets or internet sources to expand the detection vocabulary. This method is intuitive and cost-effective, as it leverages existing resources rather than relying on extensive manual annotation.
We haven't evaluated our model using the original ground truths. If we were to do so, this would be equivalent to a fully-supervised traditional object detection approach.
Thank you. Please feel free to ask if you have any additional questions.
@hanoonaR Thank you for your response. However, I would like to understand for detection data (i.e. lvis) why you have not used the ground truth (GT) boxes for knowledge distillation to align the embeddings of Region of Interest (ROI) features and clip features. This would not affect the open-vocabulary of your paper, the scale-up is maily from classification data and internet image/text pairs. The GTs are already being used for box regression( in supervised way), so why not utilize them for knowledge distillation instead of using pseudo labels? While using pseudo labels is suitable for classification data, it may not be as effective for LVIS data, as each image may have more than10 GTs. In your implementation, you have used MVIT to generate5 top proposals to align the features, potentially losing many data points in LVIS. Please let me know if I missed anything. Thank you.
@hanoonaR And another questions, in your implementation, you use resnet50 as your backbone model, while for Detic, it uses swin-transformer, did you try the bigger model such as swin? And it is noted that for objects365, your res50 model surppass detic swin-model.
Hi @Edwardmark ,
1) Regarding the question of using pseudo labels for knowledge distillation, the reason is that knowledge distillation is also used in the second stage when the model is trained with pseudo-labels for expanding the vocabulary. In this case, if the ground truths are considered for distillation, we would be limited to selecting only the base classes for distillation. However, if pseudo-labels are used, it allows selecting top-k boxes from all classes, providing a more diverse set of boxes to distil (and ensuring no leakage from ground truths for novel categories). Additionally, when ground truth boxes are used, we cannot select "K" objects from the scene for distillation, and a random selection would be made instead. The choice of "K" for LVIS, is a hyperparameter choice for the distillation loss, and may not directly relate to "the actual number of samples in the scene".
2) Apologies, but we have not evaluated our model on the swin-transformer backbone. The performance on Objects365 is the cross-dataset (zero-shot) evaluation of our best LVIS model, which uses the Detic-based CenterNet model.
Thank you. Please feel free to ask if you have any additional questions.
@muzairkhattak @mmaaz60 @hanoonaR @salman-h-khan hello, your work is amazing. Did you try to use gt box instead of pseudo label when training using detection data? Why you use psuedo label from MVit instead of the detection gt to do knowledge distillation? Looking forward to your reply, thanks.