shenyunhang / APE

[CVPR 2024] Aligning and Prompting Everything All at Once for Universal Visual Perception
https://arxiv.org/abs/2312.02153
Apache License 2.0
459 stars 28 forks source link

Inquiry on the "gated cross-modality interaction" #27

Open Masaaki-75 opened 5 months ago

Masaaki-75 commented 5 months ago

Hi! Thanks for open-sourcing APE, it is fantastic! đź‘Ť

I am new to the field of open-vocabulary vision foundation models, and I have some questions on the "gated cross-modality interaction" when going through your paper, hoping to seek your insights on a few points.

I understand that the interaction of image features and text features in GLIP causes expensive computation. But I couldn't get the part of "all-zero token", quoted:

Instead, an all-zero token Pzero serves as a special text embedding and inputs to the fusion module for all given vocabularies. In this situation, the fusion process is “static”, as no language information is injected into vision features. The Pzero could provide explicit instructions to recognize primitive concepts and slightly tune vision feature Vvoc and retain original language feature Pvoc.

shenyunhang commented 5 months ago

Sorry for this late response.

  1. As the all-zero token is different from other text tokens, it does not provide any text information, so the model may be awarded to perform OVD and OVS tasks.
  2. we only use this token for vocabulary prompts, but this token can also be used with sentence prompts, which has no effect.
  3. deformable_detr_segm.py is the no-fusion model, fusion model is deformable_detr_segm_vl.py, The all-zero token is self.name_prompt_fusion_feature. The corresponding code is here: https://github.com/shenyunhang/APE/blob/main/ape/modeling/ape_deta/deformable_detr_segm_vl.py#L158