microsoft / BridgeTower

Open source code for AAAI 2023 Paper "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning"
https://arxiv.org/abs/2206.08657
MIT License
158 stars 6 forks source link

Few questions about the implimentation of BridgeTower #3

Closed jsg921019 closed 2 years ago

jsg921019 commented 2 years ago

Thanks for the insightful research, Bridge-Tower seems to show new promising way to fuse text and image. We'd like to test this model in our local environment by tweaking the code from METER, but i am not clear in some details of the model.

  1. METER uses 11 layer encoder in clip16 by default, does Bridge-Tower follow this setting as well?

  2. Output embedding for each encoder block in clip is not layer-normed (instead it is at start of the block) in the original clip code. Does the BridgeLayer use the embeddings before layer-normed as an input? or should I ensure that layer-normed input goes in the BridgeLayer block as an input?

  3. Equation from your paper seems to share linear-projection and modal-type embeddings for all cross-modal layers. Am I understanding this right? Does they share the LayerNorm weights too?

  4. There isn't any mention for initialization of Z_0^T and Z_0^V for the first bridge layer. should it be V_7 @ W^T + V^(type) (then x == y)?

Thanks in advance!

LooperXX commented 2 years ago

Hi @jsg921019,

  1. We use 12 layers by default. Our reimplementation experiments for METER are all 12 layers for a fair comparison.
  2. We use ln_post in the VisualTransformer module to layer-norm the output embedding of each layer. (we also try to use layer-specific layer norm and use ln_post as initialization, but it doesn't work)
  3. Here, we follow METER to use the shared modality type embedding and linear projection for all layer representations of the same modality. Each BridgeLayer(including LayerNorm(post too)) is layer-specific and modality-specific.
  4. For the first layer of the cross-modal encoder, we initialize Z_0^T and Z_0^V to 0 (x=0). We omit some technical details due to the page limitation and simplification. Thank you for your questions. If you have any further questions, you can reopen this issue and comment again.
jsg921019 commented 2 years ago

@LooperXX ## I could not reopen the issue because maybe i don't have permission?

Thank you for the reply, but i am confused about 2 and 3.

From what I understood from 2, outputs from ViT is normalized by ln_post which is not layer-specific(ln_post is shared). But this seems to be conflicting with 3, modality embedding and projection is shared, but bridgelayer is layer-specific including ln_post(ln_post is not shared). Also, the word "including" is confusing to me because from what i understand, LayerNorm is the only parameter in BridgeLayer.

It would be really grateful if you can help me clarify these. Thank you!

LooperXX commented 2 years ago

Hi @jsg921019, The "LayerNorm(post too)" means we use post-style LayerNorm (random initialized) in each BridgeLayer. And ln_post (init from CLIP-ViT, you can find it in clip_modal.py in CLIP/METER code) is used by all uni-modal representations before BridgeLayers.

jsg921019 commented 2 years ago

@LooperXX

Thank you for the answer. I think i understand what you intend. I hope you have a great day :)