Pretraining Result of BridgeTower

Hello, I have implemented BridgeTower architecture according to the paper and this issue based on METER github.

However, I was not able to get the result that match the paper. Below is the validation epoch loss graph for BridgeTower(blue) and METER(orange), mlm and itm respectively.

MLM	ITM

The training graph for both models are similar, even the downstream results for VQAv2 are similar

VQAv2 test-dev
METER	77.65
BridgeTower	77.64

This is how i implemented BridgeTower

For ImageEncoder(CLIP) and TextEncdoer(RoBertA), change forward() so that it returns last 6 intermediate outputs instead of only last one. so we have [V0, V1, V2, V3, V4, V5], [T0, T1, T2, T3, T4, T5].
For CLIP, these intermediate layers are permuted to be LND -> NLD and normalized with self.ln_post.
Newly added layers are BridgeLayer with are 12 LayerNorms (6 for each modality).
starting with $Z^T_0= Z^V_0 = 0$, $\tilde{Z^V_l} = LayerNorm(Z^V_l + V_l WV + V{type})$, $\tilde{Z^T_l} = LayerNorm(Z^T_l + T_l WT + T{type})$ where LayerNorm is different for each layer, but projections W_V, W_T and type embedding T_type, V_type is shared.
Then $Z^V_l ,Z^T_l =Encoder^Z_l(\tilde{Z}^V_l , \tilde{Z}^T_l )$ just as METER.
the lr for new LayerNorms are multiplied 5 times the base lr and have no weight decay.
Rest hyperparameters are same as METER.

Is there anything wrong or anything that I missed in my implementation?? Thanks in advance.

microsoft / BridgeTower

Pretraining Result of BridgeTower #4