mlzxy / devit

MIT License
306 stars 45 forks source link

Few shot vs Open-vocabulary #13

Open theodu opened 9 months ago

theodu commented 9 months ago

I don't manage to understand the difference between your few-shot and open-vocabulary models. Your approach is based on image-only models and Open-vocabulary approach relies on a text based embedding of the category name and the model is pre-trained with text-image pairs. So what is the difference between your open-vocabulary and few shot pipelines and trained models ?

I am wondering because the open-vocabulary/LVIS model (the one in the demo) gives me much better results than the few-shot one on the same test images with the same image context

Thanks for your work!

mlzxy commented 9 months ago

Hi @theodu, thanks for your feedback. The short answer is there is no difference in pipelines between open-vocabulary and few-shot.

In the paper, I try to look at both open-vocabulary and few-shot from the same objective, achieving open-set object detection beyond a fixed category set, while using text (open-vocabulary) and using images (few-shot) only differs in their category representation.

Under this general objective, I evaluate on both open-vocabulary and few-shot benchmarks instead of only the latter ones. Honestly the dataset formats between the two are almost identical, besides the fact the open-vocabulary model performs much better because of far more recent research attention. Hope this answer could help you.