microsoft / GLIP

Grounded Language-Image Pre-training
MIT License
2.2k stars 194 forks source link

About the deep fusion module? #25

Closed PanXiebit closed 2 years ago

PanXiebit commented 2 years ago

Dear authors,

Thanks for presenting such a great work.

In your paper, you conduct the ablation study on the early fusion module, as GLIP-T(A) vs. GLIP-T(B), and demonstrates that deep fusion module brings greate improvements (zero-shot: 42.9 -> 44.9, FT: 52.9->53.8).

I have several questions about this module:

  1. In our replementation, I find that this 6-layer fusion module brings nearly twice the amount of computation. The fusion layer is composed of three sub-module: VLFuse(bi-attention), DyConv (vision only), BERTLayer(language only). Have you applyed a more detailed ablation experiment? For example, only keep VLFuse(bi-attention), remove DyConv and BERTLayer?

  2. In CoCa and ALBEF, they apply the constrastive learning before fusion, have you tried this paradigm with align before fusion?

Haotian-Zhang commented 2 years ago

@PanXiebit Thank you for the questions and support! Please allow me to answer your question below.

  1. The intuition of the deep fusion block is to encourage the visual and text features to attend to each other before the final region-word grounding contrastive loss so that it can help the model to learn more enriched representations. We actually follow the original implementation of the DyHead paper. Since their paper has 6 DyConv modules for Tiny models, we don't want to modify them because we want to verify the equivalence of our problem reformulation and maintain the SoTA performance of detection. In fact, to make our model architecture more symmetrical in beauty, we add 6 additional BERTLayers from scratch exactly corresponding to the DyConv modules. Then, we add the VLFuse modules in between to do the cross-modal attention between vision and language to further boost the performance.
  2. We do not have this in our GLIPv1 paper, but we do have similar concepts (inter-image word-region contrastive loss) in the GLIPv2 paper, and it indeed adding this loss shows even better performance. Please refer to the GLIPv2 paper on Arxiv for more details (its link is attached with the repo).

Let us know if you have further questions, thank you!

PanXiebit commented 2 years ago

@Haotian-Zhang Thanks for your response! It helps me a lot.