Closed PanXiebit closed 2 years ago
@PanXiebit Thank you for the questions and support! Please allow me to answer your question below.
Let us know if you have further questions, thank you!
@Haotian-Zhang Thanks for your response! It helps me a lot.
Dear authors,
Thanks for presenting such a great work.
In your paper, you conduct the ablation study on the early fusion module, as GLIP-T(A) vs. GLIP-T(B), and demonstrates that deep fusion module brings greate improvements (zero-shot: 42.9 -> 44.9, FT: 52.9->53.8).
I have several questions about this module:
In our replementation, I find that this 6-layer fusion module brings nearly twice the amount of computation. The fusion layer is composed of three sub-module: VLFuse(bi-attention), DyConv (vision only), BERTLayer(language only). Have you applyed a more detailed ablation experiment? For example, only keep VLFuse(bi-attention), remove DyConv and BERTLayer?
In CoCa and ALBEF, they apply the constrastive learning before fusion, have you tried this paradigm with align before fusion?