Open Masaaki-75 opened 5 months ago
Sorry for this late response.
deformable_detr_segm.py
is the no-fusion model, fusion model is deformable_detr_segm_vl.py
,
The all-zero token is self.name_prompt_fusion_feature
. The corresponding code is here: https://github.com/shenyunhang/APE/blob/main/ape/modeling/ape_deta/deformable_detr_segm_vl.py#L158
Hi! Thanks for open-sourcing APE, it is fantastic! đź‘Ť
I am new to the field of open-vocabulary vision foundation models, and I have some questions on the "gated cross-modality interaction" when going through your paper, hoping to seek your insights on a few points.
I understand that the interaction of image features and text features in GLIP causes expensive computation. But I couldn't get the part of "all-zero token", quoted: