microsoft / BridgeTower

Open source code for AAAI 2023 Paper "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning"
https://arxiv.org/abs/2206.08657
MIT License
158 stars 6 forks source link

Pretraining Result of BridgeTower #4

Closed jsg921019 closed 2 years ago

jsg921019 commented 2 years ago

Hello, I have implemented BridgeTower architecture according to the paper and this issue based on METER github.

However, I was not able to get the result that match the paper. Below is the validation epoch loss graph for BridgeTower(blue) and METER(orange), mlm and itm respectively.

MLM ITM
image image

The training graph for both models are similar, even the downstream results for VQAv2 are similar

VQAv2 test-dev
METER 77.65
BridgeTower 77.64

This is how i implemented BridgeTower

  1. For ImageEncoder(CLIP) and TextEncdoer(RoBertA), change forward() so that it returns last 6 intermediate outputs instead of only last one. so we have [V0, V1, V2, V3, V4, V5], [T0, T1, T2, T3, T4, T5].
  2. For CLIP, these intermediate layers are permuted to be LND -> NLD and normalized with self.ln_post.
  3. Newly added layers are BridgeLayer with are 12 LayerNorms (6 for each modality).
  4. starting with $Z^T_0= Z^V_0 = 0$, $\tilde{Z^V_l} = LayerNorm(Z^V_l + V_l WV + V{type})$, $\tilde{Z^T_l} = LayerNorm(Z^T_l + T_l WT + T{type})$ where LayerNorm is different for each layer, but projections W_V, W_T and type embedding T_type, V_type is shared.
  5. Then $Z^V_l ,Z^T_l =Encoder^Z_l(\tilde{Z}^V_l , \tilde{Z}^T_l )$ just as METER.
  6. the lr for new LayerNorms are multiplied 5 times the base lr and have no weight decay.
  7. Rest hyperparameters are same as METER.

Is there anything wrong or anything that I missed in my implementation?? Thanks in advance.

LooperXX commented 2 years ago

Hello, although I don't find any mistakes in your description, I notice that the mlm_val_loss in your implementation is higher than our version (0.86~0.87). Our paper has released the pre-training and VQAv2 fine-tuning hyperparameters (Tables 10 & 11). Please check these settings and wait for our code & checkpoint release.