Missing key when loading fine-tuned vqa checkpoint

Hi, I'm using the VQA.py to finetune on vqa datasets. When I load the pre-trained model ALBEF.pth, the model is loaded correctly with no missing keys. But when I load the finetuned checkpoint that I trained on vqa datasets, I get a lot of missing keys, the message is shown in the following:

_IncompatibleKeys(missing_keys=['text_encoder.embeddings.position_ids', 'text_encoder.embeddings.word_embeddings.weight', 'text_encoder.embeddings.position_embeddings.weight', 'text_encoder.embeddings.token_type_embeddings.weight', 'text_encoder.embeddings.LayerNorm.weight', 'text_encoder.embeddings.LayerNorm.bias', 'text_encoder.encoder.layer.0.attention.self.query.weight', 'text_encoder.encoder.layer.0.attention.self.query.bias', 'text_encoder.encoder.layer.0.attention.self.key.weight', 'text_encoder.encoder.layer.0.attention.self.key.bias', 'text_encoder.encoder.layer.0.attention.self.value.weight', 'text_encoder.encoder.layer.0.attention.self.value.bias', 'text_encoder.encoder.layer.0.attention.output.dense.weight', 'text_encoder.encoder.layer.0.attention.output.dense.bias', 'text_encoder.encoder.layer.0.attention.output.LayerNorm.weight', 'text_encoder.encoder.layer.0.attention.output.LayerNorm.bias', 'text_encoder.encoder.layer.0.intermediate.dense.weight', 'text_encoder.encoder.layer.0.intermediate.dense.bias', 'text_encoder.encoder.layer.0.output.dense.weight', 'text_encoder.encoder.layer.0.output.dense.bias', 'text_encoder.encoder.layer.0.output.LayerNorm.weight', 'text_encoder.encoder.layer.0.output.LayerNorm.bias', 'text_encoder.encoder.layer.1.attention.self.query.weight', 'text_encoder.encoder.layer.1.attention.self.query.bias', 'text_encoder.encoder.layer.1.attention.self.key.weight', 'text_encoder.encoder.layer.1.attention.self.key.bias', 'text_encoder.encoder.layer.1.attention.self.value.weight', 'text_encoder.encoder.layer.1.attention.self.value.bias', 'text_encoder.encoder.layer.1.attention.output.dense.weight', 'text_encoder.encoder.layer.1.attention.output.dense.bias', 'text_encoder.encoder.layer.1.attention.output.LayerNorm.weight', 'text_encoder.encoder.layer.1.attention.output.LayerNorm.bias', 'text_encoder.encoder.layer.1.intermediate.dense.weight', 'text_encoder.encoder.layer.1.intermediate.dense.bias', 'text_encoder.encoder.layer.1.output.dense.weight', 'text_encoder.encoder.layer.1.output.dense.bias', 'text_encoder.encoder.layer.1.output.LayerNorm.weight', 'text_encoder.encoder.layer.1.output.LayerNorm.bias', 'text_encoder.encoder.layer.2.attention.self.query.weight', 'text_encoder.encoder.layer.2.attention.self.query.bias', 'text_encoder.encoder.layer.2.attention.self.key.weight', 'text_encoder.encoder.layer.2.attention.self.key.bias', 'text_encoder.encoder.layer.2.attention.self.value.weight', 'text_encoder.encoder.layer.2.attention.self.value.bias', 'text_encoder.encoder.layer.2.attention.output.dense.weight', 'text_encoder.encoder.layer.2.attention.output.dense.bias', 'text_encoder.encoder.layer.2.attention.output.LayerNorm.weight', 'text_encoder.encoder.layer.2.attention.output.LayerNorm.bias', 'text_encoder.encoder.layer.2.intermediate.dense.weight', 'text_encoder.encoder.layer.2.intermediate.dense.bias', 'text_encoder.encoder.layer.2.output.dense.weight', 'text_encoder.encoder.layer.2.output.dense.bias', 'text_encoder.encoder.layer.2.output.LayerNorm.weight', 'text_encoder.encoder.layer.2.output.LayerNorm.bias', 'text_encoder.encoder.layer.3.attention.self.query.weight', 'text_encoder.encoder.layer.3.attention.self.query.bias', 'text_encoder.encoder.layer.3.attention.self.key.weight', 'text_encoder.encoder.layer.3.attention.self.key.bias', 'text_encoder.encoder.layer.3.attention.self.value.weight', 'text_encoder.encoder.layer.3.attention.self.value.bias', 'text_encoder.encoder.layer.3.attention.output.dense.weight', 'text_encoder.encoder.layer.3.attention.output.dense.bias', 'text_encoder.encoder.layer.3.attention.output.LayerNorm.weight', 'text_encoder.encoder.layer.3.attention.output.LayerNorm.bias', 'text_encoder.encoder.layer.3.intermediate.dense.weight', 'text_encoder.encoder.layer.3.intermediate.dense.bias', 'text_encoder.encoder.layer.3.output.dense.weight', 'text_encoder.encoder.layer.3.output.dense.bias', 'text_encoder.encoder.layer.3.output.LayerNorm.weight', 'text_encoder.encoder.layer.3.output.LayerNorm.bias', 'text_encoder.encoder.layer.4.attention.self.query.weight', 'text_encoder.encoder.layer.4.attention.self.query.bias', 'text_encoder.encoder.layer.4.attention.self.key.weight', 'text_encoder.encoder.layer.4.attention.self.key.bias', 'text_encoder.encoder.layer.4.attention.self.value.weight', 'text_encoder.encoder.layer.4.attention.self.value.bias', 'text_encoder.encoder.layer.4.attention.output.dense.weight', 'text_encoder.encoder.layer.4.attention.output.dense.bias', 'text_encoder.encoder.layer.4.attention.output.LayerNorm.weight', 'text_encoder.encoder.layer.4.attention.output.LayerNorm.bias', 'text_encoder.encoder.layer.4.intermediate.dense.weight', 'text_encoder.encoder.layer.4.intermediate.dense.bias', 'text_encoder.encoder.layer.4.output.dense.weight', 'text_encoder.encoder.layer.4.output.dense.bias', 'text_encoder.encoder.layer.4.output.LayerNorm.weight', 'text_encoder.encoder.layer.4.output.LayerNorm.bias', 'text_encoder.encoder.layer.5.attention.self.query.weight', 'text_encoder.encoder.layer.5.attention.self.query.bias', 'text_encoder.encoder.layer.5.attention.self.key.weight', 'text_encoder.encoder.layer.5.attention.self.key.bias', 'text_encoder.encoder.layer.5.attention.self.value.weight', 'text_encoder.encoder.layer.5.attention.self.value.bias', 'text_encoder.encoder.layer.5.attention.output.dense.weight', 'text_encoder.encoder.layer.5.attention.output.dense.bias', 'text_encoder.encoder.layer.5.attention.output.LayerNorm.weight', 'text_encoder.encoder.layer.5.attention.output.LayerNorm.bias', 'text_encoder.encoder.layer.5.intermediate.dense.weight', 'text_encoder.encoder.layer.5.intermediate.dense.bias', 'text_encoder.encoder.layer.5.output.dense.weight', 'text_encoder.encoder.layer.5.output.dense.bias', 'text_encoder.encoder.layer.5.output.LayerNorm.weight', 'text_encoder.encoder.layer.5.output.LayerNorm.bias', 'text_encoder.encoder.layer.6.attention.self.query.weight', 'text_encoder.encoder.layer.6.attention.self.query.bias', 'text_encoder.encoder.layer.6.attention.self.key.weight', 'text_encoder.encoder.layer.6.attention.self.key.bias', 'text_encoder.encoder.layer.6.attention.self.value.weight', 'text_encoder.encoder.layer.6.attention.self.value.bias', 'text_encoder.encoder.layer.6.attention.output.dense.weight', 'text_encoder.encoder.layer.6.attention.output.dense.bias', 'text_encoder.encoder.layer.6.attention.output.LayerNorm.weight', 'text_encoder.encoder.layer.6.attention.output.LayerNorm.bias', 'text_encoder.encoder.layer.6.crossattention.self.query.weight', 'text_encoder.encoder.layer.6.crossattention.self.query.bias', 'text_encoder.encoder.layer.6.crossattention.self.key.weight', 'text_encoder.encoder.layer.6.crossattention.self.key.bias', 'text_encoder.encoder.layer.6.crossattention.self.value.weight', 'text_encoder.encoder.layer.6.crossattention.self.value.bias', 'text_encoder.encoder.layer.6.crossattention.output.dense.weight', 'text_encoder.encoder.layer.6.crossattention.output.dense.bias', 'text_encoder.encoder.layer.6.crossattention.output.LayerNorm.weight', 'text_encoder.encoder.layer.6.crossattention.output.LayerNorm.bias', 'text_encoder.encoder.layer.6.intermediate.dense.weight', 'text_encoder.encoder.layer.6.intermediate.dense.bias', 'text_encoder.encoder.layer.6.output.dense.weight', 'text_encoder.encoder.layer.6.output.dense.bias', 'text_encoder.encoder.layer.6.output.LayerNorm.weight', 'text_encoder.encoder.layer.6.output.LayerNorm.bias', 'text_encoder.encoder.layer.7.attention.self.query.weight', 'text_encoder.encoder.layer.7.attention.self.query.bias', 'text_encoder.encoder.layer.7.attention.self.key.weight', 'text_encoder.encoder.layer.7.attention.self.key.bias', 'text_encoder.encoder.layer.7.attention.self.value.weight', 'text_encoder.encoder.layer.7.attention.self.value.bias', 'text_encoder.encoder.layer.7.attention.output.dense.weight', 'text_encoder.encoder.layer.7.attention.output.dense.bias', 'text_encoder.encoder.layer.7.attention.output.LayerNorm.weight', 'text_encoder.encoder.layer.7.attention.output.LayerNorm.bias', 'text_encoder.encoder.layer.7.crossattention.self.query.weight', 'text_encoder.encoder.layer.7.crossattention.self.query.bias', 'text_encoder.encoder.layer.7.crossattention.self.key.weight', 'text_encoder.encoder.layer.7.crossattention.self.key.bias', 'text_encoder.encoder.layer.7.crossattention.self.value.weight', 'text_encoder.encoder.layer.7.crossattention.self.value.bias', 'text_encoder.encoder.layer.7.crossattention.output.dense.weight', 'text_encoder.encoder.layer.7.crossattention.output.dense.bias', 'text_encoder.encoder.layer.7.crossattention.output.LayerNorm.weight', 'text_encoder.encoder.layer.7.crossattention.output.LayerNorm.bias', 'text_encoder.encoder.layer.7.intermediate.dense.weight', 'text_encoder.encoder.layer.7.intermediate.dense.bias', 'text_encoder.encoder.layer.7.output.dense.weight', 'text_encoder.encoder.layer.7.output.dense.bias', 'text_encoder.encoder.layer.7.output.LayerNorm.weight', 'text_encoder.encoder.layer.7.output.LayerNorm.bias', 'text_encoder.encoder.layer.8.attention.self.query.weight', 'text_encoder.encoder.layer.8.attention.self.query.bias', 'text_encoder.encoder.layer.8.attention.self.key.weight', 'text_encoder.encoder.layer.8.attention.self.key.bias', 'text_encoder.encoder.layer.8.attention.self.value.weight', 'text_encoder.encoder.layer.8.attention.self.value.bias', 'text_encoder.encoder.layer.8.attention.output.dense.weight', 'text_encoder.encoder.layer.8.attention.output.dense.bias', 'text_encoder.encoder.layer.8.attention.output.LayerNorm.weight', 'text_encoder.encoder.layer.8.attention.output.LayerNorm.bias', 'text_encoder.encoder.layer.8.crossattention.self.query.weight', 'text_encoder.encoder.layer.8.crossattention.self.query.bias', 'text_encoder.encoder.layer.8.crossattention.self.key.weight', 'text_encoder.encoder.layer.8.crossattention.self.key.bias', 'text_encoder.encoder.layer.8.crossattention.self.value.weight', 'text_encoder.encoder.layer.8.crossattention.self.value.bias', 'text_encoder.encoder.layer.8.crossattention.output.dense.weight', 'text_encoder.encoder.layer.8.crossattention.output.dense.bias', 'text_encoder.encoder.layer.8.crossattention.output.LayerNorm.weight', 'text_encoder.encoder.layer.8.crossattention.output.LayerNorm.bias', 'text_encoder.encoder.layer.8.intermediate.dense.weight', 'text_encoder.encoder.layer.8.intermediate.dense.bias', 'text_encoder.encoder.layer.8.output.dense.weight', 'text_encoder.encoder.layer.8.output.dense.bias', 'text_encoder.encoder.layer.8.output.LayerNorm.weight', 'text_encoder.encoder.layer.8.output.LayerNorm.bias', 'text_encoder.encoder.layer.9.attention.self.query.weight', 'text_encoder.encoder.layer.9.attention.self.query.bias', 'text_encoder.encoder.layer.9.attention.self.key.weight', 'text_encoder.encoder.layer.9.attention.self.key.bias', 'text_encoder.encoder.layer.9.attention.self.value.weight', 'text_encoder.encoder.layer.9.attention.self.value.bias', 'text_encoder.encoder.layer.9.attention.output.dense.weight', 'text_encoder.encoder.layer.9.attention.output.dense.bias', 'text_encoder.encoder.layer.9.attention.output.LayerNorm.weight', 'text_encoder.encoder.layer.9.attention.output.LayerNorm.bias', 'text_encoder.encoder.layer.9.crossattention.self.query.weight', 'text_encoder.encoder.layer.9.crossattention.self.query.bias', 'text_encoder.encoder.layer.9.crossattention.self.key.weight', 'text_encoder.encoder.layer.9.crossattention.self.key.bias', 'text_encoder.encoder.layer.9.crossattention.self.value.weight', 'text_encoder.encoder.layer.9.crossattention.self.value.bias', 'text_encoder.encoder.layer.9.crossattention.output.dense.weight', 'text_encoder.encoder.layer.9.crossattention.output.dense.bias', 'text_encoder.encoder.layer.9.crossattention.output.LayerNorm.weight', 'text_encoder.encoder.layer.9.crossattention.output.LayerNorm.bias', 'text_encoder.encoder.layer.9.intermediate.dense.weight', 'text_encoder.encoder.layer.9.intermediate.dense.bias', 'text_encoder.encoder.layer.9.output.dense.weight', 'text_encoder.encoder.layer.9.output.dense.bias', 'text_encoder.encoder.layer.9.output.LayerNorm.weight', 'text_encoder.encoder.layer.9.output.LayerNorm.bias', 'text_encoder.encoder.layer.10.attention.self.query.weight', 'text_encoder.encoder.layer.10.attention.self.query.bias', 'text_encoder.encoder.layer.10.attention.self.key.weight', 'text_encoder.encoder.layer.10.attention.self.key.bias', 'text_encoder.encoder.layer.10.attention.self.value.weight', 'text_encoder.encoder.layer.10.attention.self.value.bias', 'text_encoder.encoder.layer.10.attention.output.dense.weight', 'text_encoder.encoder.layer.10.attention.output.dense.bias', 'text_encoder.encoder.layer.10.attention.output.LayerNorm.weight', 'text_encoder.encoder.layer.10.attention.output.LayerNorm.bias', 'text_encoder.encoder.layer.10.crossattention.self.query.weight', 'text_encoder.encoder.layer.10.crossattention.self.query.bias', 'text_encoder.encoder.layer.10.crossattention.self.key.weight', 'text_encoder.encoder.layer.10.crossattention.self.key.bias', 'text_encoder.encoder.layer.10.crossattention.self.value.weight', 'text_encoder.encoder.layer.10.crossattention.self.value.bias', 'text_encoder.encoder.layer.10.crossattention.output.dense.weight', 'text_encoder.encoder.layer.10.crossattention.output.dense.bias', 'text_encoder.encoder.layer.10.crossattention.output.LayerNorm.weight', 'text_encoder.encoder.layer.10.crossattention.output.LayerNorm.bias', 'text_encoder.encoder.layer.10.intermediate.dense.weight', 'text_encoder.encoder.layer.10.intermediate.dense.bias', 'text_encoder.encoder.layer.10.output.dense.weight', 'text_encoder.encoder.layer.10.output.dense.bias', 'text_encoder.encoder.layer.10.output.LayerNorm.weight', 'text_encoder.encoder.layer.10.output.LayerNorm.bias', 'text_encoder.encoder.layer.11.attention.self.query.weight', 'text_encoder.encoder.layer.11.attention.self.query.bias', 'text_encoder.encoder.layer.11.attention.self.key.weight', 'text_encoder.encoder.layer.11.attention.self.key.bias', 'text_encoder.encoder.layer.11.attention.self.value.weight', 'text_encoder.encoder.layer.11.attention.self.value.bias', 'text_encoder.encoder.layer.11.attention.output.dense.weight', 'text_encoder.encoder.layer.11.attention.output.dense.bias', 'text_encoder.encoder.layer.11.attention.output.LayerNorm.weight', 'text_encoder.encoder.layer.11.attention.output.LayerNorm.bias', 'text_encoder.encoder.layer.11.crossattention.self.query.weight', 'text_encoder.encoder.layer.11.crossattention.self.query.bias', 'text_encoder.encoder.layer.11.crossattention.self.key.weight', 'text_encoder.encoder.layer.11.crossattention.self.key.bias', 'text_encoder.encoder.layer.11.crossattention.self.value.weight', 'text_encoder.encoder.layer.11.crossattention.self.value.bias', 'text_encoder.encoder.layer.11.crossattention.output.dense.weight', 'text_encoder.encoder.layer.11.crossattention.output.dense.bias', 'text_encoder.encoder.layer.11.crossattention.output.LayerNorm.weight', 'text_encoder.encoder.layer.11.crossattention.output.LayerNorm.bias', 'text_encoder.encoder.layer.11.intermediate.dense.weight', 'text_encoder.encoder.layer.11.intermediate.dense.bias', 'text_encoder.encoder.layer.11.output.dense.weight', 'text_encoder.encoder.layer.11.output.dense.bias', 'text_encoder.encoder.layer.11.output.LayerNorm.weight', 'text_encoder.encoder.layer.11.output.LayerNorm.bias'], unexpected_keys=['visual_encoder_m.cls_token', 'visual_encoder_m.pos_embed', 'visual_encoder_m.patch_embed.proj.weight', 'visual_encoder_m.patch_embed.proj.bias', 'visual_encoder_m.blocks.0.norm1.weight', 'visual_encoder_m.blocks.0.norm1.bias', 'visual_encoder_m.blocks.0.attn.qkv.weight', 'visual_encoder_m.blocks.0.attn.qkv.bias', 'visual_encoder_m.blocks.0.attn.proj.weight', 'visual_encoder_m.blocks.0.attn.proj.bias', 'visual_encoder_m.blocks.0.norm2.weight', 'visual_encoder_m.blocks.0.norm2.bias', 'visual_encoder_m.blocks.0.mlp.fc1.weight', 'visual_encoder_m.blocks.0.mlp.fc1.bias', 'visual_encoder_m.blocks.0.mlp.fc2.weight', 'visual_encoder_m.blocks.0.mlp.fc2.bias', 'visual_encoder_m.blocks.1.norm1.weight', 'visual_encoder_m.blocks.1.norm1.bias', 'visual_encoder_m.blocks.1.attn.qkv.weight', 'visual_encoder_m.blocks.1.attn.qkv.bias', 'visual_encoder_m.blocks.1.attn.proj.weight', 'visual_encoder_m.blocks.1.attn.proj.bias', 'visual_encoder_m.blocks.1.norm2.weight', 'visual_encoder_m.blocks.1.norm2.bias', 'visual_encoder_m.blocks.1.mlp.fc1.weight', 'visual_encoder_m.blocks.1.mlp.fc1.bias', 'visual_encoder_m.blocks.1.mlp.fc2.weight', 'visual_encoder_m.blocks.1.mlp.fc2.bias', 'visual_encoder_m.blocks.2.norm1.weight', 'visual_encoder_m.blocks.2.norm1.bias', 'visual_encoder_m.blocks.2.attn.qkv.weight', 'visual_encoder_m.blocks.2.attn.qkv.bias', 'visual_encoder_m.blocks.2.attn.proj.weight', 'visual_encoder_m.blocks.2.attn.proj.bias', 'visual_encoder_m.blocks.2.norm2.weight', 'visual_encoder_m.blocks.2.norm2.bias', 'visual_encoder_m.blocks.2.mlp.fc1.weight', 'visual_encoder_m.blocks.2.mlp.fc1.bias', 'visual_encoder_m.blocks.2.mlp.fc2.weight', 'visual_encoder_m.blocks.2.mlp.fc2.bias', 'visual_encoder_m.blocks.3.norm1.weight', 'visual_encoder_m.blocks.3.norm1.bias', 'visual_encoder_m.blocks.3.attn.qkv.weight', 'visual_encoder_m.blocks.3.attn.qkv.bias', 'visual_encoder_m.blocks.3.attn.proj.weight', 'visual_encoder_m.blocks.3.attn.proj.bias', 'visual_encoder_m.blocks.3.norm2.weight', 'visual_encoder_m.blocks.3.norm2.bias', 'visual_encoder_m.blocks.3.mlp.fc1.weight', 'visual_encoder_m.blocks.3.mlp.fc1.bias', 'visual_encoder_m.blocks.3.mlp.fc2.weight', 'visual_encoder_m.blocks.3.mlp.fc2.bias', 'visual_encoder_m.blocks.4.norm1.weight', 'visual_encoder_m.blocks.4.norm1.bias', 'visual_encoder_m.blocks.4.attn.qkv.weight', 'visual_encoder_m.blocks.4.attn.qkv.bias', 'visual_encoder_m.blocks.4.attn.proj.weight', 'visual_encoder_m.blocks.4.attn.proj.bias', 'visual_encoder_m.blocks.4.norm2.weight', 'visual_encoder_m.blocks.4.norm2.bias', 'visual_encoder_m.blocks.4.mlp.fc1.weight', 'visual_encoder_m.blocks.4.mlp.fc1.bias', 'visual_encoder_m.blocks.4.mlp.fc2.weight', 'visual_encoder_m.blocks.4.mlp.fc2.bias', 'visual_encoder_m.blocks.5.norm1.weight', 'visual_encoder_m.blocks.5.norm1.bias', 'visual_encoder_m.blocks.5.attn.qkv.weight', 'visual_encoder_m.blocks.5.attn.qkv.bias', 'visual_encoder_m.blocks.5.attn.proj.weight', 'visual_encoder_m.blocks.5.attn.proj.bias', 'visual_encoder_m.blocks.5.norm2.weight', 'visual_encoder_m.blocks.5.norm2.bias', 'visual_encoder_m.blocks.5.mlp.fc1.weight', 'visual_encoder_m.blocks.5.mlp.fc1.bias', 'visual_encoder_m.blocks.5.mlp.fc2.weight', 'visual_encoder_m.blocks.5.mlp.fc2.bias', 'visual_encoder_m.blocks.6.norm1.weight', 'visual_encoder_m.blocks.6.norm1.bias', 'visual_encoder_m.blocks.6.attn.qkv.weight', 'visual_encoder_m.blocks.6.attn.qkv.bias', 'visual_encoder_m.blocks.6.attn.proj.weight', 'visual_encoder_m.blocks.6.attn.proj.bias', 'visual_encoder_m.blocks.6.norm2.weight', 'visual_encoder_m.blocks.6.norm2.bias', 'visual_encoder_m.blocks.6.mlp.fc1.weight', 'visual_encoder_m.blocks.6.mlp.fc1.bias', 'visual_encoder_m.blocks.6.mlp.fc2.weight', 'visual_encoder_m.blocks.6.mlp.fc2.bias', 'visual_encoder_m.blocks.7.norm1.weight', 'visual_encoder_m.blocks.7.norm1.bias', 'visual_encoder_m.blocks.7.attn.qkv.weight', 'visual_encoder_m.blocks.7.attn.qkv.bias', 'visual_encoder_m.blocks.7.attn.proj.weight', 'visual_encoder_m.blocks.7.attn.proj.bias', 'visual_encoder_m.blocks.7.norm2.weight', 'visual_encoder_m.blocks.7.norm2.bias', 'visual_encoder_m.blocks.7.mlp.fc1.weight', 'visual_encoder_m.blocks.7.mlp.fc1.bias', 'visual_encoder_m.blocks.7.mlp.fc2.weight', 'visual_encoder_m.blocks.7.mlp.fc2.bias', 'visual_encoder_m.blocks.8.norm1.weight', 'visual_encoder_m.blocks.8.norm1.bias', 'visual_encoder_m.blocks.8.attn.qkv.weight', 'visual_encoder_m.blocks.8.attn.qkv.bias', 'visual_encoder_m.blocks.8.attn.proj.weight', 'visual_encoder_m.blocks.8.attn.proj.bias', 'visual_encoder_m.blocks.8.norm2.weight', 'visual_encoder_m.blocks.8.norm2.bias', 'visual_encoder_m.blocks.8.mlp.fc1.weight', 'visual_encoder_m.blocks.8.mlp.fc1.bias', 'visual_encoder_m.blocks.8.mlp.fc2.weight', 'visual_encoder_m.blocks.8.mlp.fc2.bias', 'visual_encoder_m.blocks.9.norm1.weight', 'visual_encoder_m.blocks.9.norm1.bias', 'visual_encoder_m.blocks.9.attn.qkv.weight', 'visual_encoder_m.blocks.9.attn.qkv.bias', 'visual_encoder_m.blocks.9.attn.proj.weight', 'visual_encoder_m.blocks.9.attn.proj.bias', 'visual_encoder_m.blocks.9.norm2.weight', 'visual_encoder_m.blocks.9.norm2.bias', 'visual_encoder_m.blocks.9.mlp.fc1.weight', 'visual_encoder_m.blocks.9.mlp.fc1.bias', 'visual_encoder_m.blocks.9.mlp.fc2.weight', 'visual_encoder_m.blocks.9.mlp.fc2.bias', 'visual_encoder_m.blocks.10.norm1.weight', 'visual_encoder_m.blocks.10.norm1.bias', 'visual_encoder_m.blocks.10.attn.qkv.weight', 'visual_encoder_m.blocks.10.attn.qkv.bias', 'visual_encoder_m.blocks.10.attn.proj.weight', 'visual_encoder_m.blocks.10.attn.proj.bias', 'visual_encoder_m.blocks.10.norm2.weight', 'visual_encoder_m.blocks.10.norm2.bias', 'visual_encoder_m.blocks.10.mlp.fc1.weight', 'visual_encoder_m.blocks.10.mlp.fc1.bias', 'visual_encoder_m.blocks.10.mlp.fc2.weight', 'visual_encoder_m.blocks.10.mlp.fc2.bias', 'visual_encoder_m.blocks.11.norm1.weight', 'visual_encoder_m.blocks.11.norm1.bias', 'visual_encoder_m.blocks.11.attn.qkv.weight', 'visual_encoder_m.blocks.11.attn.qkv.bias', 'visual_encoder_m.blocks.11.attn.proj.weight', 'visual_encoder_m.blocks.11.attn.proj.bias', 'visual_encoder_m.blocks.11.norm2.weight', 'visual_encoder_m.blocks.11.norm2.bias', 'visual_encoder_m.blocks.11.mlp.fc1.weight', 'visual_encoder_m.blocks.11.mlp.fc1.bias', 'visual_encoder_m.blocks.11.mlp.fc2.weight', 'visual_encoder_m.blocks.11.mlp.fc2.bias', 'visual_encoder_m.norm.weight', 'visual_encoder_m.norm.bias', 'text_decoder_m.bert.embeddings.position_ids', 'text_decoder_m.bert.embeddings.word_embeddings.weight', 'text_decoder_m.bert.embeddings.position_embeddings.weight', 'text_decoder_m.bert.embeddings.token_type_embeddings.weight', 'text_decoder_m.bert.embeddings.LayerNorm.weight', 'text_decoder_m.bert.embeddings.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.0.attention.self.query.weight', 'text_decoder_m.bert.encoder.layer.0.attention.self.query.bias', 'text_decoder_m.bert.encoder.layer.0.attention.self.key.weight', 'text_decoder_m.bert.encoder.layer.0.attention.self.key.bias', 'text_decoder_m.bert.encoder.layer.0.attention.self.value.weight', 'text_decoder_m.bert.encoder.layer.0.attention.self.value.bias', 'text_decoder_m.bert.encoder.layer.0.attention.output.dense.weight', 'text_decoder_m.bert.encoder.layer.0.attention.output.dense.bias', 'text_decoder_m.bert.encoder.layer.0.attention.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.0.attention.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.0.crossattention.self.query.weight', 'text_decoder_m.bert.encoder.layer.0.crossattention.self.query.bias', 'text_decoder_m.bert.encoder.layer.0.crossattention.self.key.weight', 'text_decoder_m.bert.encoder.layer.0.crossattention.self.key.bias', 'text_decoder_m.bert.encoder.layer.0.crossattention.self.value.weight', 'text_decoder_m.bert.encoder.layer.0.crossattention.self.value.bias', 'text_decoder_m.bert.encoder.layer.0.crossattention.output.dense.weight', 'text_decoder_m.bert.encoder.layer.0.crossattention.output.dense.bias', 'text_decoder_m.bert.encoder.layer.0.crossattention.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.0.crossattention.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.0.intermediate.dense.weight', 'text_decoder_m.bert.encoder.layer.0.intermediate.dense.bias', 'text_decoder_m.bert.encoder.layer.0.output.dense.weight', 'text_decoder_m.bert.encoder.layer.0.output.dense.bias', 'text_decoder_m.bert.encoder.layer.0.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.0.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.1.attention.self.query.weight', 'text_decoder_m.bert.encoder.layer.1.attention.self.query.bias', 'text_decoder_m.bert.encoder.layer.1.attention.self.key.weight', 'text_decoder_m.bert.encoder.layer.1.attention.self.key.bias', 'text_decoder_m.bert.encoder.layer.1.attention.self.value.weight', 'text_decoder_m.bert.encoder.layer.1.attention.self.value.bias', 'text_decoder_m.bert.encoder.layer.1.attention.output.dense.weight', 'text_decoder_m.bert.encoder.layer.1.attention.output.dense.bias', 'text_decoder_m.bert.encoder.layer.1.attention.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.1.attention.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.1.crossattention.self.query.weight', 'text_decoder_m.bert.encoder.layer.1.crossattention.self.query.bias', 'text_decoder_m.bert.encoder.layer.1.crossattention.self.key.weight', 'text_decoder_m.bert.encoder.layer.1.crossattention.self.key.bias', 'text_decoder_m.bert.encoder.layer.1.crossattention.self.value.weight', 'text_decoder_m.bert.encoder.layer.1.crossattention.self.value.bias', 'text_decoder_m.bert.encoder.layer.1.crossattention.output.dense.weight', 'text_decoder_m.bert.encoder.layer.1.crossattention.output.dense.bias', 'text_decoder_m.bert.encoder.layer.1.crossattention.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.1.crossattention.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.1.intermediate.dense.weight', 'text_decoder_m.bert.encoder.layer.1.intermediate.dense.bias', 'text_decoder_m.bert.encoder.layer.1.output.dense.weight', 'text_decoder_m.bert.encoder.layer.1.output.dense.bias', 'text_decoder_m.bert.encoder.layer.1.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.1.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.2.attention.self.query.weight', 'text_decoder_m.bert.encoder.layer.2.attention.self.query.bias', 'text_decoder_m.bert.encoder.layer.2.attention.self.key.weight', 'text_decoder_m.bert.encoder.layer.2.attention.self.key.bias', 'text_decoder_m.bert.encoder.layer.2.attention.self.value.weight', 'text_decoder_m.bert.encoder.layer.2.attention.self.value.bias', 'text_decoder_m.bert.encoder.layer.2.attention.output.dense.weight', 'text_decoder_m.bert.encoder.layer.2.attention.output.dense.bias', 'text_decoder_m.bert.encoder.layer.2.attention.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.2.attention.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.2.crossattention.self.query.weight', 'text_decoder_m.bert.encoder.layer.2.crossattention.self.query.bias', 'text_decoder_m.bert.encoder.layer.2.crossattention.self.key.weight', 'text_decoder_m.bert.encoder.layer.2.crossattention.self.key.bias', 'text_decoder_m.bert.encoder.layer.2.crossattention.self.value.weight', 'text_decoder_m.bert.encoder.layer.2.crossattention.self.value.bias', 'text_decoder_m.bert.encoder.layer.2.crossattention.output.dense.weight', 'text_decoder_m.bert.encoder.layer.2.crossattention.output.dense.bias', 'text_decoder_m.bert.encoder.layer.2.crossattention.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.2.crossattention.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.2.intermediate.dense.weight', 'text_decoder_m.bert.encoder.layer.2.intermediate.dense.bias', 'text_decoder_m.bert.encoder.layer.2.output.dense.weight', 'text_decoder_m.bert.encoder.layer.2.output.dense.bias', 'text_decoder_m.bert.encoder.layer.2.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.2.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.3.attention.self.query.weight', 'text_decoder_m.bert.encoder.layer.3.attention.self.query.bias', 'text_decoder_m.bert.encoder.layer.3.attention.self.key.weight', 'text_decoder_m.bert.encoder.layer.3.attention.self.key.bias', 'text_decoder_m.bert.encoder.layer.3.attention.self.value.weight', 'text_decoder_m.bert.encoder.layer.3.attention.self.value.bias', 'text_decoder_m.bert.encoder.layer.3.attention.output.dense.weight', 'text_decoder_m.bert.encoder.layer.3.attention.output.dense.bias', 'text_decoder_m.bert.encoder.layer.3.attention.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.3.attention.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.3.crossattention.self.query.weight', 'text_decoder_m.bert.encoder.layer.3.crossattention.self.query.bias', 'text_decoder_m.bert.encoder.layer.3.crossattention.self.key.weight', 'text_decoder_m.bert.encoder.layer.3.crossattention.self.key.bias', 'text_decoder_m.bert.encoder.layer.3.crossattention.self.value.weight', 'text_decoder_m.bert.encoder.layer.3.crossattention.self.value.bias', 'text_decoder_m.bert.encoder.layer.3.crossattention.output.dense.weight', 'text_decoder_m.bert.encoder.layer.3.crossattention.output.dense.bias', 'text_decoder_m.bert.encoder.layer.3.crossattention.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.3.crossattention.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.3.intermediate.dense.weight', 'text_decoder_m.bert.encoder.layer.3.intermediate.dense.bias', 'text_decoder_m.bert.encoder.layer.3.output.dense.weight', 'text_decoder_m.bert.encoder.layer.3.output.dense.bias', 'text_decoder_m.bert.encoder.layer.3.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.3.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.4.attention.self.query.weight', 'text_decoder_m.bert.encoder.layer.4.attention.self.query.bias', 'text_decoder_m.bert.encoder.layer.4.attention.self.key.weight', 'text_decoder_m.bert.encoder.layer.4.attention.self.key.bias', 'text_decoder_m.bert.encoder.layer.4.attention.self.value.weight', 'text_decoder_m.bert.encoder.layer.4.attention.self.value.bias', 'text_decoder_m.bert.encoder.layer.4.attention.output.dense.weight', 'text_decoder_m.bert.encoder.layer.4.attention.output.dense.bias', 'text_decoder_m.bert.encoder.layer.4.attention.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.4.attention.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.4.crossattention.self.query.weight', 'text_decoder_m.bert.encoder.layer.4.crossattention.self.query.bias', 'text_decoder_m.bert.encoder.layer.4.crossattention.self.key.weight', 'text_decoder_m.bert.encoder.layer.4.crossattention.self.key.bias', 'text_decoder_m.bert.encoder.layer.4.crossattention.self.value.weight', 'text_decoder_m.bert.encoder.layer.4.crossattention.self.value.bias', 'text_decoder_m.bert.encoder.layer.4.crossattention.output.dense.weight', 'text_decoder_m.bert.encoder.layer.4.crossattention.output.dense.bias', 'text_decoder_m.bert.encoder.layer.4.crossattention.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.4.crossattention.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.4.intermediate.dense.weight', 'text_decoder_m.bert.encoder.layer.4.intermediate.dense.bias', 'text_decoder_m.bert.encoder.layer.4.output.dense.weight', 'text_decoder_m.bert.encoder.layer.4.output.dense.bias', 'text_decoder_m.bert.encoder.layer.4.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.4.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.5.attention.self.query.weight', 'text_decoder_m.bert.encoder.layer.5.attention.self.query.bias', 'text_decoder_m.bert.encoder.layer.5.attention.self.key.weight', 'text_decoder_m.bert.encoder.layer.5.attention.self.key.bias', 'text_decoder_m.bert.encoder.layer.5.attention.self.value.weight', 'text_decoder_m.bert.encoder.layer.5.attention.self.value.bias', 'text_decoder_m.bert.encoder.layer.5.attention.output.dense.weight', 'text_decoder_m.bert.encoder.layer.5.attention.output.dense.bias', 'text_decoder_m.bert.encoder.layer.5.attention.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.5.attention.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.5.crossattention.self.query.weight', 'text_decoder_m.bert.encoder.layer.5.crossattention.self.query.bias', 'text_decoder_m.bert.encoder.layer.5.crossattention.self.key.weight', 'text_decoder_m.bert.encoder.layer.5.crossattention.self.key.bias', 'text_decoder_m.bert.encoder.layer.5.crossattention.self.value.weight', 'text_decoder_m.bert.encoder.layer.5.crossattention.self.value.bias', 'text_decoder_m.bert.encoder.layer.5.crossattention.output.dense.weight', 'text_decoder_m.bert.encoder.layer.5.crossattention.output.dense.bias', 'text_decoder_m.bert.encoder.layer.5.crossattention.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.5.crossattention.output.LayerNorm.bias', 'text_decoder_m.bert.encoder.layer.5.intermediate.dense.weight', 'text_decoder_m.bert.encoder.layer.5.intermediate.dense.bias', 'text_decoder_m.bert.encoder.layer.5.output.dense.weight', 'text_decoder_m.bert.encoder.layer.5.output.dense.bias', 'text_decoder_m.bert.encoder.layer.5.output.LayerNorm.weight', 'text_decoder_m.bert.encoder.layer.5.output.LayerNorm.bias', 'text_decoder_m.cls.predictions.bias', 'text_decoder_m.cls.predictions.transform.dense.weight', 'text_decoder_m.cls.predictions.transform.dense.bias', 'text_decoder_m.cls.predictions.transform.LayerNorm.weight', 'text_decoder_m.cls.predictions.transform.LayerNorm.bias', 'text_decoder_m.cls.predictions.decoder.weight', 'text_decoder_m.cls.predictions.decoder.bias', 'text_decoder_m.embeddings.position_ids', 'text_decoder_m.embeddings.word_embeddings.weight', 'text_decoder_m.embeddings.position_embeddings.weight', 'text_decoder_m.embeddings.token_type_embeddings.weight', 'text_decoder_m.embeddings.LayerNorm.weight', 'text_decoder_m.embeddings.LayerNorm.bias', 'text_decoder_m.encoder.layer.6.0.self.query.weight', 'text_decoder_m.encoder.layer.6.0.self.query.bias', 'text_decoder_m.encoder.layer.6.0.self.key.weight', 'text_decoder_m.encoder.layer.6.0.self.key.bias', 'text_decoder_m.encoder.layer.6.0.self.value.weight', 'text_decoder_m.encoder.layer.6.0.self.value.bias', 'text_decoder_m.encoder.layer.6.0.output.dense.weight', 'text_decoder_m.encoder.layer.6.0.output.dense.bias', 'text_decoder_m.encoder.layer.6.0.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.6.0.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.6.0.dense.weight', 'text_decoder_m.encoder.layer.6.0.dense.bias', 'text_decoder_m.encoder.layer.6.0.LayerNorm.weight', 'text_decoder_m.encoder.layer.6.0.LayerNorm.bias', 'text_decoder_m.encoder.layer.7.1.self.query.weight', 'text_decoder_m.encoder.layer.7.1.self.query.bias', 'text_decoder_m.encoder.layer.7.1.self.key.weight', 'text_decoder_m.encoder.layer.7.1.self.key.bias', 'text_decoder_m.encoder.layer.7.1.self.value.weight', 'text_decoder_m.encoder.layer.7.1.self.value.bias', 'text_decoder_m.encoder.layer.7.1.output.dense.weight', 'text_decoder_m.encoder.layer.7.1.output.dense.bias', 'text_decoder_m.encoder.layer.7.1.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.7.1.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.7.1.dense.weight', 'text_decoder_m.encoder.layer.7.1.dense.bias', 'text_decoder_m.encoder.layer.7.1.LayerNorm.weight', 'text_decoder_m.encoder.layer.7.1.LayerNorm.bias', 'text_decoder_m.encoder.layer.8.2.self.query.weight', 'text_decoder_m.encoder.layer.8.2.self.query.bias', 'text_decoder_m.encoder.layer.8.2.self.key.weight', 'text_decoder_m.encoder.layer.8.2.self.key.bias', 'text_decoder_m.encoder.layer.8.2.self.value.weight', 'text_decoder_m.encoder.layer.8.2.self.value.bias', 'text_decoder_m.encoder.layer.8.2.output.dense.weight', 'text_decoder_m.encoder.layer.8.2.output.dense.bias', 'text_decoder_m.encoder.layer.8.2.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.8.2.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.8.2.dense.weight', 'text_decoder_m.encoder.layer.8.2.dense.bias', 'text_decoder_m.encoder.layer.8.2.LayerNorm.weight', 'text_decoder_m.encoder.layer.8.2.LayerNorm.bias', 'text_decoder_m.encoder.layer.9.3.self.query.weight', 'text_decoder_m.encoder.layer.9.3.self.query.bias', 'text_decoder_m.encoder.layer.9.3.self.key.weight', 'text_decoder_m.encoder.layer.9.3.self.key.bias', 'text_decoder_m.encoder.layer.9.3.self.value.weight', 'text_decoder_m.encoder.layer.9.3.self.value.bias', 'text_decoder_m.encoder.layer.9.3.output.dense.weight', 'text_decoder_m.encoder.layer.9.3.output.dense.bias', 'text_decoder_m.encoder.layer.9.3.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.9.3.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.9.3.dense.weight', 'text_decoder_m.encoder.layer.9.3.dense.bias', 'text_decoder_m.encoder.layer.9.3.LayerNorm.weight', 'text_decoder_m.encoder.layer.9.3.LayerNorm.bias', 'text_decoder_m.encoder.layer.10.4.self.query.weight', 'text_decoder_m.encoder.layer.10.4.self.query.bias', 'text_decoder_m.encoder.layer.10.4.self.key.weight', 'text_decoder_m.encoder.layer.10.4.self.key.bias', 'text_decoder_m.encoder.layer.10.4.self.value.weight', 'text_decoder_m.encoder.layer.10.4.self.value.bias', 'text_decoder_m.encoder.layer.10.4.output.dense.weight', 'text_decoder_m.encoder.layer.10.4.output.dense.bias', 'text_decoder_m.encoder.layer.10.4.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.10.4.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.10.4.dense.weight', 'text_decoder_m.encoder.layer.10.4.dense.bias', 'text_decoder_m.encoder.layer.10.4.LayerNorm.weight', 'text_decoder_m.encoder.layer.10.4.LayerNorm.bias', 'text_decoder_m.encoder.layer.11.5.self.query.weight', 'text_decoder_m.encoder.layer.11.5.self.query.bias', 'text_decoder_m.encoder.layer.11.5.self.key.weight', 'text_decoder_m.encoder.layer.11.5.self.key.bias', 'text_decoder_m.encoder.layer.11.5.self.value.weight', 'text_decoder_m.encoder.layer.11.5.self.value.bias', 'text_decoder_m.encoder.layer.11.5.output.dense.weight', 'text_decoder_m.encoder.layer.11.5.output.dense.bias', 'text_decoder_m.encoder.layer.11.5.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.11.5.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.11.5.dense.weight', 'text_decoder_m.encoder.layer.11.5.dense.bias', 'text_decoder_m.encoder.layer.11.5.LayerNorm.weight', 'text_decoder_m.encoder.layer.11.5.LayerNorm.bias', 'text_decoder_m.encoder.layer.0.attention.self.query.weight', 'text_decoder_m.encoder.layer.0.attention.self.query.bias', 'text_decoder_m.encoder.layer.0.attention.self.key.weight', 'text_decoder_m.encoder.layer.0.attention.self.key.bias', 'text_decoder_m.encoder.layer.0.attention.self.value.weight', 'text_decoder_m.encoder.layer.0.attention.self.value.bias', 'text_decoder_m.encoder.layer.0.attention.output.dense.weight', 'text_decoder_m.encoder.layer.0.attention.output.dense.bias', 'text_decoder_m.encoder.layer.0.attention.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.0.attention.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.0.crossattention.self.query.weight', 'text_decoder_m.encoder.layer.0.crossattention.self.query.bias', 'text_decoder_m.encoder.layer.0.crossattention.self.key.weight', 'text_decoder_m.encoder.layer.0.crossattention.self.key.bias', 'text_decoder_m.encoder.layer.0.crossattention.self.value.weight', 'text_decoder_m.encoder.layer.0.crossattention.self.value.bias', 'text_decoder_m.encoder.layer.0.crossattention.output.dense.weight', 'text_decoder_m.encoder.layer.0.crossattention.output.dense.bias', 'text_decoder_m.encoder.layer.0.crossattention.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.0.crossattention.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.0.intermediate.dense.weight', 'text_decoder_m.encoder.layer.0.intermediate.dense.bias', 'text_decoder_m.encoder.layer.0.output.dense.weight', 'text_decoder_m.encoder.layer.0.output.dense.bias', 'text_decoder_m.encoder.layer.0.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.0.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.1.attention.self.query.weight', 'text_decoder_m.encoder.layer.1.attention.self.query.bias', 'text_decoder_m.encoder.layer.1.attention.self.key.weight', 'text_decoder_m.encoder.layer.1.attention.self.key.bias', 'text_decoder_m.encoder.layer.1.attention.self.value.weight', 'text_decoder_m.encoder.layer.1.attention.self.value.bias', 'text_decoder_m.encoder.layer.1.attention.output.dense.weight', 'text_decoder_m.encoder.layer.1.attention.output.dense.bias', 'text_decoder_m.encoder.layer.1.attention.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.1.attention.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.1.crossattention.self.query.weight', 'text_decoder_m.encoder.layer.1.crossattention.self.query.bias', 'text_decoder_m.encoder.layer.1.crossattention.self.key.weight', 'text_decoder_m.encoder.layer.1.crossattention.self.key.bias', 'text_decoder_m.encoder.layer.1.crossattention.self.value.weight', 'text_decoder_m.encoder.layer.1.crossattention.self.value.bias', 'text_decoder_m.encoder.layer.1.crossattention.output.dense.weight', 'text_decoder_m.encoder.layer.1.crossattention.output.dense.bias', 'text_decoder_m.encoder.layer.1.crossattention.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.1.crossattention.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.1.intermediate.dense.weight', 'text_decoder_m.encoder.layer.1.intermediate.dense.bias', 'text_decoder_m.encoder.layer.1.output.dense.weight', 'text_decoder_m.encoder.layer.1.output.dense.bias', 'text_decoder_m.encoder.layer.1.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.1.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.2.attention.self.query.weight', 'text_decoder_m.encoder.layer.2.attention.self.query.bias', 'text_decoder_m.encoder.layer.2.attention.self.key.weight', 'text_decoder_m.encoder.layer.2.attention.self.key.bias', 'text_decoder_m.encoder.layer.2.attention.self.value.weight', 'text_decoder_m.encoder.layer.2.attention.self.value.bias', 'text_decoder_m.encoder.layer.2.attention.output.dense.weight', 'text_decoder_m.encoder.layer.2.attention.output.dense.bias', 'text_decoder_m.encoder.layer.2.attention.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.2.attention.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.2.crossattention.self.query.weight', 'text_decoder_m.encoder.layer.2.crossattention.self.query.bias', 'text_decoder_m.encoder.layer.2.crossattention.self.key.weight', 'text_decoder_m.encoder.layer.2.crossattention.self.key.bias', 'text_decoder_m.encoder.layer.2.crossattention.self.value.weight', 'text_decoder_m.encoder.layer.2.crossattention.self.value.bias', 'text_decoder_m.encoder.layer.2.crossattention.output.dense.weight', 'text_decoder_m.encoder.layer.2.crossattention.output.dense.bias', 'text_decoder_m.encoder.layer.2.crossattention.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.2.crossattention.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.2.intermediate.dense.weight', 'text_decoder_m.encoder.layer.2.intermediate.dense.bias', 'text_decoder_m.encoder.layer.2.output.dense.weight', 'text_decoder_m.encoder.layer.2.output.dense.bias', 'text_decoder_m.encoder.layer.2.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.2.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.3.attention.self.query.weight', 'text_decoder_m.encoder.layer.3.attention.self.query.bias', 'text_decoder_m.encoder.layer.3.attention.self.key.weight', 'text_decoder_m.encoder.layer.3.attention.self.key.bias', 'text_decoder_m.encoder.layer.3.attention.self.value.weight', 'text_decoder_m.encoder.layer.3.attention.self.value.bias', 'text_decoder_m.encoder.layer.3.attention.output.dense.weight', 'text_decoder_m.encoder.layer.3.attention.output.dense.bias', 'text_decoder_m.encoder.layer.3.attention.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.3.attention.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.3.crossattention.self.query.weight', 'text_decoder_m.encoder.layer.3.crossattention.self.query.bias', 'text_decoder_m.encoder.layer.3.crossattention.self.key.weight', 'text_decoder_m.encoder.layer.3.crossattention.self.key.bias', 'text_decoder_m.encoder.layer.3.crossattention.self.value.weight', 'text_decoder_m.encoder.layer.3.crossattention.self.value.bias', 'text_decoder_m.encoder.layer.3.crossattention.output.dense.weight', 'text_decoder_m.encoder.layer.3.crossattention.output.dense.bias', 'text_decoder_m.encoder.layer.3.crossattention.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.3.crossattention.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.3.intermediate.dense.weight', 'text_decoder_m.encoder.layer.3.intermediate.dense.bias', 'text_decoder_m.encoder.layer.3.output.dense.weight', 'text_decoder_m.encoder.layer.3.output.dense.bias', 'text_decoder_m.encoder.layer.3.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.3.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.4.attention.self.query.weight', 'text_decoder_m.encoder.layer.4.attention.self.query.bias', 'text_decoder_m.encoder.layer.4.attention.self.key.weight', 'text_decoder_m.encoder.layer.4.attention.self.key.bias', 'text_decoder_m.encoder.layer.4.attention.self.value.weight', 'text_decoder_m.encoder.layer.4.attention.self.value.bias', 'text_decoder_m.encoder.layer.4.attention.output.dense.weight', 'text_decoder_m.encoder.layer.4.attention.output.dense.bias', 'text_decoder_m.encoder.layer.4.attention.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.4.attention.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.4.crossattention.self.query.weight', 'text_decoder_m.encoder.layer.4.crossattention.self.query.bias', 'text_decoder_m.encoder.layer.4.crossattention.self.key.weight', 'text_decoder_m.encoder.layer.4.crossattention.self.key.bias', 'text_decoder_m.encoder.layer.4.crossattention.self.value.weight', 'text_decoder_m.encoder.layer.4.crossattention.self.value.bias', 'text_decoder_m.encoder.layer.4.crossattention.output.dense.weight', 'text_decoder_m.encoder.layer.4.crossattention.output.dense.bias', 'text_decoder_m.encoder.layer.4.crossattention.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.4.crossattention.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.4.intermediate.dense.weight', 'text_decoder_m.encoder.layer.4.intermediate.dense.bias', 'text_decoder_m.encoder.layer.4.output.dense.weight', 'text_decoder_m.encoder.layer.4.output.dense.bias', 'text_decoder_m.encoder.layer.4.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.4.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.5.attention.self.query.weight', 'text_decoder_m.encoder.layer.5.attention.self.query.bias', 'text_decoder_m.encoder.layer.5.attention.self.key.weight', 'text_decoder_m.encoder.layer.5.attention.self.key.bias', 'text_decoder_m.encoder.layer.5.attention.self.value.weight', 'text_decoder_m.encoder.layer.5.attention.self.value.bias', 'text_decoder_m.encoder.layer.5.attention.output.dense.weight', 'text_decoder_m.encoder.layer.5.attention.output.dense.bias', 'text_decoder_m.encoder.layer.5.attention.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.5.attention.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.5.crossattention.self.query.weight', 'text_decoder_m.encoder.layer.5.crossattention.self.query.bias', 'text_decoder_m.encoder.layer.5.crossattention.self.key.weight', 'text_decoder_m.encoder.layer.5.crossattention.self.key.bias', 'text_decoder_m.encoder.layer.5.crossattention.self.value.weight', 'text_decoder_m.encoder.layer.5.crossattention.self.value.bias', 'text_decoder_m.encoder.layer.5.crossattention.output.dense.weight', 'text_decoder_m.encoder.layer.5.crossattention.output.dense.bias', 'text_decoder_m.encoder.layer.5.crossattention.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.5.crossattention.output.LayerNorm.bias', 'text_decoder_m.encoder.layer.5.intermediate.dense.weight', 'text_decoder_m.encoder.layer.5.intermediate.dense.bias', 'text_decoder_m.encoder.layer.5.output.dense.weight', 'text_decoder_m.encoder.layer.5.output.dense.bias', 'text_decoder_m.encoder.layer.5.output.LayerNorm.weight', 'text_decoder_m.encoder.layer.5.output.LayerNorm.bias', 'text_decoder.embeddings.position_ids', 'text_decoder.embeddings.word_embeddings.weight', 'text_decoder.embeddings.position_embeddings.weight', 'text_decoder.embeddings.token_type_embeddings.weight', 'text_decoder.embeddings.LayerNorm.weight', 'text_decoder.embeddings.LayerNorm.bias', 'text_decoder.encoder.layer.6.0.self.query.weight', 'text_decoder.encoder.layer.6.0.self.query.bias', 'text_decoder.encoder.layer.6.0.self.key.weight', 'text_decoder.encoder.layer.6.0.self.key.bias', 'text_decoder.encoder.layer.6.0.self.value.weight', 'text_decoder.encoder.layer.6.0.self.value.bias', 'text_decoder.encoder.layer.6.0.output.dense.weight', 'text_decoder.encoder.layer.6.0.output.dense.bias', 'text_decoder.encoder.layer.6.0.output.LayerNorm.weight', 'text_decoder.encoder.layer.6.0.output.LayerNorm.bias', 'text_decoder.encoder.layer.6.0.dense.weight', 'text_decoder.encoder.layer.6.0.dense.bias', 'text_decoder.encoder.layer.6.0.LayerNorm.weight', 'text_decoder.encoder.layer.6.0.LayerNorm.bias', 'text_decoder.encoder.layer.7.1.self.query.weight', 'text_decoder.encoder.layer.7.1.self.query.bias', 'text_decoder.encoder.layer.7.1.self.key.weight', 'text_decoder.encoder.layer.7.1.self.key.bias', 'text_decoder.encoder.layer.7.1.self.value.weight', 'text_decoder.encoder.layer.7.1.self.value.bias', 'text_decoder.encoder.layer.7.1.output.dense.weight', 'text_decoder.encoder.layer.7.1.output.dense.bias', 'text_decoder.encoder.layer.7.1.output.LayerNorm.weight', 'text_decoder.encoder.layer.7.1.output.LayerNorm.bias', 'text_decoder.encoder.layer.7.1.dense.weight', 'text_decoder.encoder.layer.7.1.dense.bias', 'text_decoder.encoder.layer.7.1.LayerNorm.weight', 'text_decoder.encoder.layer.7.1.LayerNorm.bias', 'text_decoder.encoder.layer.8.2.self.query.weight', 'text_decoder.encoder.layer.8.2.self.query.bias', 'text_decoder.encoder.layer.8.2.self.key.weight', 'text_decoder.encoder.layer.8.2.self.key.bias', 'text_decoder.encoder.layer.8.2.self.value.weight', 'text_decoder.encoder.layer.8.2.self.value.bias', 'text_decoder.encoder.layer.8.2.output.dense.weight', 'text_decoder.encoder.layer.8.2.output.dense.bias', 'text_decoder.encoder.layer.8.2.output.LayerNorm.weight', 'text_decoder.encoder.layer.8.2.output.LayerNorm.bias', 'text_decoder.encoder.layer.8.2.dense.weight', 'text_decoder.encoder.layer.8.2.dense.bias', 'text_decoder.encoder.layer.8.2.LayerNorm.weight', 'text_decoder.encoder.layer.8.2.LayerNorm.bias', 'text_decoder.encoder.layer.9.3.self.query.weight', 'text_decoder.encoder.layer.9.3.self.query.bias', 'text_decoder.encoder.layer.9.3.self.key.weight', 'text_decoder.encoder.layer.9.3.self.key.bias', 'text_decoder.encoder.layer.9.3.self.value.weight', 'text_decoder.encoder.layer.9.3.self.value.bias', 'text_decoder.encoder.layer.9.3.output.dense.weight', 'text_decoder.encoder.layer.9.3.output.dense.bias', 'text_decoder.encoder.layer.9.3.output.LayerNorm.weight', 'text_decoder.encoder.layer.9.3.output.LayerNorm.bias', 'text_decoder.encoder.layer.9.3.dense.weight', 'text_decoder.encoder.layer.9.3.dense.bias', 'text_decoder.encoder.layer.9.3.LayerNorm.weight', 'text_decoder.encoder.layer.9.3.LayerNorm.bias', 'text_decoder.encoder.layer.10.4.self.query.weight', 'text_decoder.encoder.layer.10.4.self.query.bias', 'text_decoder.encoder.layer.10.4.self.key.weight', 'text_decoder.encoder.layer.10.4.self.key.bias', 'text_decoder.encoder.layer.10.4.self.value.weight', 'text_decoder.encoder.layer.10.4.self.value.bias', 'text_decoder.encoder.layer.10.4.output.dense.weight', 'text_decoder.encoder.layer.10.4.output.dense.bias', 'text_decoder.encoder.layer.10.4.output.LayerNorm.weight', 'text_decoder.encoder.layer.10.4.output.LayerNorm.bias', 'text_decoder.encoder.layer.10.4.dense.weight', 'text_decoder.encoder.layer.10.4.dense.bias', 'text_decoder.encoder.layer.10.4.LayerNorm.weight', 'text_decoder.encoder.layer.10.4.LayerNorm.bias', 'text_decoder.encoder.layer.11.5.self.query.weight', 'text_decoder.encoder.layer.11.5.self.query.bias', 'text_decoder.encoder.layer.11.5.self.key.weight', 'text_decoder.encoder.layer.11.5.self.key.bias', 'text_decoder.encoder.layer.11.5.self.value.weight', 'text_decoder.encoder.layer.11.5.self.value.bias', 'text_decoder.encoder.layer.11.5.output.dense.weight', 'text_decoder.encoder.layer.11.5.output.dense.bias', 'text_decoder.encoder.layer.11.5.output.LayerNorm.weight', 'text_decoder.encoder.layer.11.5.output.LayerNorm.bias', 'text_decoder.encoder.layer.11.5.dense.weight', 'text_decoder.encoder.layer.11.5.dense.bias', 'text_decoder.encoder.layer.11.5.LayerNorm.weight', 'text_decoder.encoder.layer.11.5.LayerNorm.bias', 'text_decoder.encoder.layer.0.attention.self.query.weight', 'text_decoder.encoder.layer.0.attention.self.query.bias', 'text_decoder.encoder.layer.0.attention.self.key.weight', 'text_decoder.encoder.layer.0.attention.self.key.bias', 'text_decoder.encoder.layer.0.attention.self.value.weight', 'text_decoder.encoder.layer.0.attention.self.value.bias', 'text_decoder.encoder.layer.0.attention.output.dense.weight', 'text_decoder.encoder.layer.0.attention.output.dense.bias', 'text_decoder.encoder.layer.0.attention.output.LayerNorm.weight', 'text_decoder.encoder.layer.0.attention.output.LayerNorm.bias', 'text_decoder.encoder.layer.0.crossattention.self.query.weight', 'text_decoder.encoder.layer.0.crossattention.self.query.bias', 'text_decoder.encoder.layer.0.crossattention.self.key.weight', 'text_decoder.encoder.layer.0.crossattention.self.key.bias', 'text_decoder.encoder.layer.0.crossattention.self.value.weight', 'text_decoder.encoder.layer.0.crossattention.self.value.bias', 'text_decoder.encoder.layer.0.crossattention.output.dense.weight', 'text_decoder.encoder.layer.0.crossattention.output.dense.bias', 'text_decoder.encoder.layer.0.crossattention.output.LayerNorm.weight', 'text_decoder.encoder.layer.0.crossattention.output.LayerNorm.bias', 'text_decoder.encoder.layer.0.intermediate.dense.weight', 'text_decoder.encoder.layer.0.intermediate.dense.bias', 'text_decoder.encoder.layer.0.output.dense.weight', 'text_decoder.encoder.layer.0.output.dense.bias', 'text_decoder.encoder.layer.0.output.LayerNorm.weight', 'text_decoder.encoder.layer.0.output.LayerNorm.bias', 'text_decoder.encoder.layer.1.attention.self.query.weight', 'text_decoder.encoder.layer.1.attention.self.query.bias', 'text_decoder.encoder.layer.1.attention.self.key.weight', 'text_decoder.encoder.layer.1.attention.self.key.bias', 'text_decoder.encoder.layer.1.attention.self.value.weight', 'text_decoder.encoder.layer.1.attention.self.value.bias', 'text_decoder.encoder.layer.1.attention.output.dense.weight', 'text_decoder.encoder.layer.1.attention.output.dense.bias', 'text_decoder.encoder.layer.1.attention.output.LayerNorm.weight', 'text_decoder.encoder.layer.1.attention.output.LayerNorm.bias', 'text_decoder.encoder.layer.1.crossattention.self.query.weight', 'text_decoder.encoder.layer.1.crossattention.self.query.bias', 'text_decoder.encoder.layer.1.crossattention.self.key.weight', 'text_decoder.encoder.layer.1.crossattention.self.key.bias', 'text_decoder.encoder.layer.1.crossattention.self.value.weight', 'text_decoder.encoder.layer.1.crossattention.self.value.bias', 'text_decoder.encoder.layer.1.crossattention.output.dense.weight', 'text_decoder.encoder.layer.1.crossattention.output.dense.bias', 'text_decoder.encoder.layer.1.crossattention.output.LayerNorm.weight', 'text_decoder.encoder.layer.1.crossattention.output.LayerNorm.bias', 'text_decoder.encoder.layer.1.intermediate.dense.weight', 'text_decoder.encoder.layer.1.intermediate.dense.bias', 'text_decoder.encoder.layer.1.output.dense.weight', 'text_decoder.encoder.layer.1.output.dense.bias', 'text_decoder.encoder.layer.1.output.LayerNorm.weight', 'text_decoder.encoder.layer.1.output.LayerNorm.bias', 'text_decoder.encoder.layer.2.attention.self.query.weight', 'text_decoder.encoder.layer.2.attention.self.query.bias', 'text_decoder.encoder.layer.2.attention.self.key.weight', 'text_decoder.encoder.layer.2.attention.self.key.bias', 'text_decoder.encoder.layer.2.attention.self.value.weight', 'text_decoder.encoder.layer.2.attention.self.value.bias', 'text_decoder.encoder.layer.2.attention.output.dense.weight', 'text_decoder.encoder.layer.2.attention.output.dense.bias', 'text_decoder.encoder.layer.2.attention.output.LayerNorm.weight', 'text_decoder.encoder.layer.2.attention.output.LayerNorm.bias', 'text_decoder.encoder.layer.2.crossattention.self.query.weight', 'text_decoder.encoder.layer.2.crossattention.self.query.bias', 'text_decoder.encoder.layer.2.crossattention.self.key.weight', 'text_decoder.encoder.layer.2.crossattention.self.key.bias', 'text_decoder.encoder.layer.2.crossattention.self.value.weight', 'text_decoder.encoder.layer.2.crossattention.self.value.bias', 'text_decoder.encoder.layer.2.crossattention.output.dense.weight', 'text_decoder.encoder.layer.2.crossattention.output.dense.bias', 'text_decoder.encoder.layer.2.crossattention.output.LayerNorm.weight', 'text_decoder.encoder.layer.2.crossattention.output.LayerNorm.bias', 'text_decoder.encoder.layer.2.intermediate.dense.weight', 'text_decoder.encoder.layer.2.intermediate.dense.bias', 'text_decoder.encoder.layer.2.output.dense.weight', 'text_decoder.encoder.layer.2.output.dense.bias', 'text_decoder.encoder.layer.2.output.LayerNorm.weight', 'text_decoder.encoder.layer.2.output.LayerNorm.bias', 'text_decoder.encoder.layer.3.attention.self.query.weight', 'text_decoder.encoder.layer.3.attention.self.query.bias', 'text_decoder.encoder.layer.3.attention.self.key.weight', 'text_decoder.encoder.layer.3.attention.self.key.bias', 'text_decoder.encoder.layer.3.attention.self.value.weight', 'text_decoder.encoder.layer.3.attention.self.value.bias', 'text_decoder.encoder.layer.3.attention.output.dense.weight', 'text_decoder.encoder.layer.3.attention.output.dense.bias', 'text_decoder.encoder.layer.3.attention.output.LayerNorm.weight', 'text_decoder.encoder.layer.3.attention.output.LayerNorm.bias', 'text_decoder.encoder.layer.3.crossattention.self.query.weight', 'text_decoder.encoder.layer.3.crossattention.self.query.bias', 'text_decoder.encoder.layer.3.crossattention.self.key.weight', 'text_decoder.encoder.layer.3.crossattention.self.key.bias', 'text_decoder.encoder.layer.3.crossattention.self.value.weight', 'text_decoder.encoder.layer.3.crossattention.self.value.bias', 'text_decoder.encoder.layer.3.crossattention.output.dense.weight', 'text_decoder.encoder.layer.3.crossattention.output.dense.bias', 'text_decoder.encoder.layer.3.crossattention.output.LayerNorm.weight', 'text_decoder.encoder.layer.3.crossattention.output.LayerNorm.bias', 'text_decoder.encoder.layer.3.intermediate.dense.weight', 'text_decoder.encoder.layer.3.intermediate.dense.bias', 'text_decoder.encoder.layer.3.output.dense.weight', 'text_decoder.encoder.layer.3.output.dense.bias', 'text_decoder.encoder.layer.3.output.LayerNorm.weight', 'text_decoder.encoder.layer.3.output.LayerNorm.bias', 'text_decoder.encoder.layer.4.attention.self.query.weight', 'text_decoder.encoder.layer.4.attention.self.query.bias', 'text_decoder.encoder.layer.4.attention.self.key.weight', 'text_decoder.encoder.layer.4.attention.self.key.bias', 'text_decoder.encoder.layer.4.attention.self.value.weight', 'text_decoder.encoder.layer.4.attention.self.value.bias', 'text_decoder.encoder.layer.4.attention.output.dense.weight', 'text_decoder.encoder.layer.4.attention.output.dense.bias', 'text_decoder.encoder.layer.4.attention.output.LayerNorm.weight', 'text_decoder.encoder.layer.4.attention.output.LayerNorm.bias', 'text_decoder.encoder.layer.4.crossattention.self.query.weight', 'text_decoder.encoder.layer.4.crossattention.self.query.bias', 'text_decoder.encoder.layer.4.crossattention.self.key.weight', 'text_decoder.encoder.layer.4.crossattention.self.key.bias', 'text_decoder.encoder.layer.4.crossattention.self.value.weight', 'text_decoder.encoder.layer.4.crossattention.self.value.bias', 'text_decoder.encoder.layer.4.crossattention.output.dense.weight', 'text_decoder.encoder.layer.4.crossattention.output.dense.bias', 'text_decoder.encoder.layer.4.crossattention.output.LayerNorm.weight', 'text_decoder.encoder.layer.4.crossattention.output.LayerNorm.bias', 'text_decoder.encoder.layer.4.intermediate.dense.weight', 'text_decoder.encoder.layer.4.intermediate.dense.bias', 'text_decoder.encoder.layer.4.output.dense.weight', 'text_decoder.encoder.layer.4.output.dense.bias', 'text_decoder.encoder.layer.4.output.LayerNorm.weight', 'text_decoder.encoder.layer.4.output.LayerNorm.bias', 'text_decoder.encoder.layer.5.attention.self.query.weight', 'text_decoder.encoder.layer.5.attention.self.query.bias', 'text_decoder.encoder.layer.5.attention.self.key.weight', 'text_decoder.encoder.layer.5.attention.self.key.bias', 'text_decoder.encoder.layer.5.attention.self.value.weight', 'text_decoder.encoder.layer.5.attention.self.value.bias', 'text_decoder.encoder.layer.5.attention.output.dense.weight', 'text_decoder.encoder.layer.5.attention.output.dense.bias', 'text_decoder.encoder.layer.5.attention.output.LayerNorm.weight', 'text_decoder.encoder.layer.5.attention.output.LayerNorm.bias', 'text_decoder.encoder.layer.5.crossattention.self.query.weight', 'text_decoder.encoder.layer.5.crossattention.self.query.bias', 'text_decoder.encoder.layer.5.crossattention.self.key.weight', 'text_decoder.encoder.layer.5.crossattention.self.key.bias', 'text_decoder.encoder.layer.5.crossattention.self.value.weight', 'text_decoder.encoder.layer.5.crossattention.self.value.bias', 'text_decoder.encoder.layer.5.crossattention.output.dense.weight', 'text_decoder.encoder.layer.5.crossattention.output.dense.bias', 'text_decoder.encoder.layer.5.crossattention.output.LayerNorm.weight', 'text_decoder.encoder.layer.5.crossattention.output.LayerNorm.bias', 'text_decoder.encoder.layer.5.intermediate.dense.weight', 'text_decoder.encoder.layer.5.intermediate.dense.bias', 'text_decoder.encoder.layer.5.output.dense.weight', 'text_decoder.encoder.layer.5.output.dense.bias', 'text_decoder.encoder.layer.5.output.LayerNorm.weight', 'text_decoder.encoder.layer.5.output.LayerNorm.bias'])

when loading the vqa.pth, there are also many missing keys. The model still continue to train, but is it normal or the model is being loaded in a wrong way?

Btw, I also want to ask about the vqa checkpoint that you provided in the repo. It seems to me that the checkpoint doesn't contain a teacher model, but in the paper the training details explained that the distillation is used for downstream tasks. I understand it as that during the finetuning on VQAV2 dataset, the distillation is still set to True, but the teacher model is not saved in the checkpoint. Is that correct?

salesforce / ALBEF

Missing key when loading fine-tuned vqa checkpoint #101