Inquiry on the "gated cross-modality interaction"

Hi! Thanks for open-sourcing APE, it is fantastic! 👍

I am new to the field of open-vocabulary vision foundation models, and I have some questions on the "gated cross-modality interaction" when going through your paper, hoping to seek your insights on a few points.

I understand that the interaction of image features and text features in GLIP causes expensive computation. But I couldn't get the part of "all-zero token", quoted:

Instead, an all-zero token Pzero serves as a special text embedding and inputs to the fusion module for all given vocabularies. In this situation, the fusion process is “static”, as no language information is injected into vision features. The Pzero could provide explicit instructions to recognize primitive concepts and slightly tune vision feature Vvoc and retain original language feature Pvoc.

How does it work? I mean, how does an all-zero token provide instructions to recognize concepts?
In this paragraph, it seems that this token is only applied for word prompts, while deprecated for sentence prompts? But in Figure 2, the zero token is interacting with sentence prompts. Am I missing something?
Where is the corresponding code for Pzero? Is it https://github.com/shenyunhang/APE/blob/main/ape/modeling/ape_deta/deformable_detr_segm.py#L220 ?

shenyunhang / APE

Inquiry on the "gated cross-modality interaction" #27