Closed jsg921019 closed 2 years ago
Hi @jsg921019,
@LooperXX ## I could not reopen the issue because maybe i don't have permission?
Thank you for the reply, but i am confused about 2 and 3.
From what I understood from 2, outputs from ViT is normalized by ln_post which is not layer-specific(ln_post is shared). But this seems to be conflicting with 3, modality embedding and projection is shared, but bridgelayer is layer-specific including ln_post(ln_post is not shared). Also, the word "including" is confusing to me because from what i understand, LayerNorm is the only parameter in BridgeLayer.
It would be really grateful if you can help me clarify these. Thank you!
Hi @jsg921019, The "LayerNorm(post too)" means we use post-style LayerNorm (random initialized) in each BridgeLayer. And ln_post (init from CLIP-ViT, you can find it in clip_modal.py in CLIP/METER code) is used by all uni-modal representations before BridgeLayers.
@LooperXX
Thank you for the answer. I think i understand what you intend. I hope you have a great day :)
Thanks for the insightful research, Bridge-Tower seems to show new promising way to fuse text and image. We'd like to test this model in our local environment by tweaking the code from METER, but i am not clear in some details of the model.
METER uses 11 layer encoder in clip16 by default, does Bridge-Tower follow this setting as well?
Output embedding for each encoder block in clip is not layer-normed (instead it is at start of the block) in the original clip code. Does the BridgeLayer use the embeddings before layer-normed as an input? or should I ensure that layer-normed input goes in the BridgeLayer block as an input?
Equation from your paper seems to share linear-projection and modal-type embeddings for all cross-modal layers. Am I understanding this right? Does they share the LayerNorm weights too?
There isn't any mention for initialization of Z_0^T and Z_0^V for the first bridge layer. should it be V_7 @ W^T + V^(type) (then x == y)?
Thanks in advance!