Cannot Load .pt model - Githubissues

System Info

Python version 3.11

transformers version: 4.42.3
Platform: Windows-10-10.0.22631-SP0
Python version: 3.11.0
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.2
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA GeForce RTX 4060 Laptop GPU

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Finetuned model using https://www.kaggle.com/code/chlorinecl/notebook4101d69eb6

Download .pt model and load it using

import torch
from transformers import AutoProcessor, SeamlessM4TModel
new_model = torch.load("./expt4_m4tM.pt")
processor = AutoProcessor.from_pretrained("seamless-m4t-medium")
model_seam = SeamlessM4TModel.from_pretrained("seamless-m4t-medium")
model_seam.load_state_dict(new_model)
model_seam.save_pretrained("./new_seamless-m4t-medium")

Outputs:

D:\projects\GNNNER\venv\Lib\site-packages\transformers\deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "D:\projects\GNNNER\convert_bin_to_pt.py", line 6, in <module>
model_seam.load_state_dict(new_model)
File "D:\projects\GNNNER\venv\Lib\site-packages\torch\nn\modules\module.py", line 2189, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for SeamlessM4TModel:
Missing key(s) in state_dict: "shared.weight", "text_encoder.embed_tokens.weight", "text_encoder.layers.0.self_attn.k_proj.weight", "text_encoder.layers.0.self_attn.k_proj.bias", "text_encoder.layers.0.self_attn.v_proj.weight", "text_encoder.layers.0.self_attn.v_proj.bias", "text_encoder.layers.0.self_attn.q_proj.weight", "text_encoder.layers.0.self_attn.q_proj.bias", "text_encoder.layers.0.self_attn.out_proj.weight", "text_encoder.layers.0.self_attn.out_proj.bias", "text_encoder.layers.0.self_attn_layer_norm.weight", "text_encoder.layers.0.self_attn_layer_norm.bias", "text_encoder.layers.0.ffn.fc1.weight", "text_encoder.layers.0.ffn.fc1.bias", "text_encoder.layers.0.ffn.fc2.weight", "text_encoder.layers.0.ffn.fc2.bias", "text_encoder.layers.0.ffn_layer_norm.weight", "text_encoder.layers.0.ffn_layer_norm.bias", "text_encoder.layers.1.self_attn.k_proj.weight", "text_encoder.layers.1.self_attn.k_proj.bias", "text_encoder.layers.1.self_attn.v_proj.weight", "text_encoder.layers.1.self_attn.v_proj.bias", "text_encoder.layers.1.self_attn.q_proj.weight", "text_encoder.layers.1.self_attn.q_proj.bias", "text_encoder.layers.1.self_attn.out_proj.weight", "text_encoder.layers.1.self_attn.out_proj.bias", "text_encoder.layers.1.self_attn_layer_norm.weight", "text_encoder.layers.1.self_attn_layer_norm.bias", "text_encoder.layers.1.ffn.fc1.weight", "text_encoder.layers.1.ffn.fc1.bias", "text_encoder.layers.1.ffn.fc2.weight", "text_encoder.layers.1.ffn.fc2.bias", "text_encoder.layers.1.ffn_layer_norm.weight", "text_encoder.layers.1.ffn_layer_norm.bias", "text_encoder.layers.2.self_attn.k_proj.weight", "text_encoder.layers.2.self_attn.k_proj.bias", "text_encoder.layers.2.self_attn.v_proj.weight", "text_encoder.layers.2.self_attn.v_proj.bias", "text_encoder.layers.2.self_attn.q_proj.weight", "text_encoder.layers.2.self_attn.q_proj.bias", "text_encoder.layers.2.self_attn.out_proj.weight", "text_encoder.layers.2.self_attn.out_proj.bias", "text_encoder.layers.2.self_attn_layer_norm.weight", "text_encoder.layers.2.self_attn_layer_norm.bias", "text_encoder.layers.2.ffn.fc1.weight", "text_encoder.layers.2.ffn.fc1.bias", "text_encoder.layers.2.ffn.fc2.weight", "text_encoder.layers.2.ffn.fc2.bias", "text_encoder.layers.2.ffn_layer_norm.weight", "text_encoder.layers.2.ffn_layer_norm.bias", "text_encoder.layers.3.self_attn.k_proj.weight", "text_encoder.layers.3.self_attn.k_proj.bias", "text_encoder.layers.3.self_attn.v_proj.weight", "text_encoder.layers.3.self_attn.v_proj.bias", "text_encoder.layers.3.self_attn.q_proj.weight", "text_encoder.layers.3.self_attn.q_proj.bias", "text_encoder.layers.3.self_attn.out_proj.weight", "text_encoder.layers.3.self_attn.out_proj.bias", "text_encoder.layers.3.self_attn_layer_norm.weight", "text_encoder.layers.3.self_attn_layer_norm.bias", "text_encoder.layers.3.ffn.fc1.weight", "text_encoder.layers.3.ffn.fc1.bias", "text_encoder.layers.3.ffn.fc2.weight", "text_encoder.layers.3.ffn.fc2.bias", "text_encoder.layers.3.ffn_layer_norm.weight", "text_encoder.layers.3.ffn_layer_norm.bias", "text_encoder.layers.4.self_attn.k_proj.weight", "text_encoder.layers.4.self_attn.k_proj.bias", "text_encoder.layers.4.self_attn.v_proj.weight", "text_encoder.layers.4.self_attn.v_proj.bias", "text_encoder.layers.4.self_attn.q_proj.weight", "text_encoder.layers.4.self_attn.q_proj.bias", "text_encoder.layers.4.self_attn.out_proj.weight", "text_encoder.layers.4.self_attn.out_proj.bias", "text_encoder.layers.4.self_attn_layer_norm.weight", "text_encoder.layers.4.self_attn_layer_norm.bias", "text_encoder.layers.4.ffn.fc1.weight", "text_encoder.layers.4.ffn.fc1.bias", "text_encoder.layers.4.ffn.fc2.weight", "text_encoder.layers.4.ffn.fc2.bias", "text_encoder.layers.4.ffn_layer_norm.weight", "text_encoder.layers.4.ffn_layer_norm.bias", "text_encoder.layers.5.self_attn.k_proj.weight", "text_encoder.layers.5.self_attn.k_proj.bias", "text_encoder.layers.5.self_attn.v_proj.weight", "text_encoder.layers.5.self_attn.v_proj.bias", "text_encoder.layers.5.self_attn.q_proj.weight", "text_encoder.layers.5.self_attn.q_proj.bias", "text_encoder.layers.5.self_attn.out_proj.weight", "text_encoder.layers.5.self_attn.out_proj.bias", "text_encoder.layers.5.self_attn_layer_norm.weight", "text_encoder.layers.5.self_attn_layer_norm.bias", "text_encoder.layers.5.ffn.fc1.weight", "text_encoder.layers.5.ffn.fc1.bias", "text_encoder.layers.5.ffn.fc2.weight", "text_encoder.layers.5.ffn.fc2.bias", "text_encoder.layers.5.ffn_layer_norm.weight", "text_encoder.layers.5.ffn_layer_norm.bias", "text_encoder.layers.6.self_attn.k_proj.weight", "text_encoder.layers.6.self_attn.k_proj.bias", "text_encoder.layers.6.self_attn.v_proj.weight", "text_encoder.layers.6.self_attn.v_proj.bias", "text_encoder.layers.6.self_attn.q_proj.weight", "text_encoder.layers.6.self_attn.q_proj.bias", "text_encoder.layers.6.self_attn.out_proj.weight", "text_encoder.layers.6.self_attn.out_proj.bias", "text_encoder.layers.6.self_attn_layer_norm.weight", "text_encoder.layers.6.self_attn_layer_norm.bias", "text_encoder.layers.6.ffn.fc1.weight", "text_encoder.layers.6.ffn.fc1.bias", "text_encoder.layers.6.ffn.fc2.weight", "text_encoder.layers.6.ffn.fc2.bias", "text_encoder.layers.6.ffn_layer_norm.weight", "text_encoder.layers.6.ffn_layer_norm.bias", "text_encoder.layers.7.self_attn.k_proj.weight", "text_encoder.layers.7.self_attn.k_proj.bias", "text_encoder.layers.7.self_attn.v_proj.weight", "text_encoder.layers.7.self_attn.v_proj.bias", "text_encoder.layers.7.self_attn.q_proj.weight", "text_encoder.layers.7.self_attn.q_proj.bias", "text_encoder.layers.7.self_attn.out_proj.weight", "text_encoder.layers.7.self_attn.out_proj.bias", "text_encoder.layers.7.self_attn_layer_norm.weight", "text_encoder.layers.7.self_attn_layer_norm.bias", "text_encoder.layers.7.ffn.fc1.weight", "text_encoder.layers.7.ffn.fc1.bias", "text_encoder.layers.7.ffn.fc2.weight", "text_encoder.layers.7.ffn.fc2.bias", "text_encoder.layers.7.ffn_layer_norm.weight", "text_encoder.layers.7.ffn_layer_norm.bias", "text_encoder.layers.8.self_attn.k_proj.weight", "text_encoder.layers.8.self_attn.k_proj.bias", "text_encoder.layers.8.self_attn.v_proj.weight", "text_encoder.layers.8.self_attn.v_proj.bias", "text_encoder.layers.8.self_attn.q_proj.weight", "text_encoder.layers.8.self_attn.q_proj.bias", "text_encoder.layers.8.self_attn.out_proj.weight", "text_encoder.layers.8.self_attn.out_proj.bias", "text_encoder.layers.8.self_attn_layer_norm.weight", "text_encoder.layers.8.self_attn_layer_norm.bias", "text_encoder.layers.8.ffn.fc1.weight", "text_encoder.layers.8.ffn.fc1.bias", "text_encoder.layers.8.ffn.fc2.weight", "text_encoder.layers.8.ffn.fc2.bias", "text_encoder.layers.8.ffn_layer_norm.weight", "text_encoder.layers.8.ffn_layer_norm.bias", "text_encoder.layers.9.self_attn.k_proj.weight", "text_encoder.layers.9.self_attn.k_proj.bias", "text_encoder.layers.9.self_attn.v_proj.weight", "text_encoder.layers.9.self_attn.v_proj.bias", "text_encoder.layers.9.self_attn.q_proj.weight", "text_encoder.layers.9.self_attn.q_proj.bias", "text_encoder.layers.9.self_attn.out_proj.weight", "text_encoder.layers.9.self_attn.out_proj.bias", "text_encoder.layers.9.self_attn_layer_norm.weight", "text_encoder.layers.9.self_attn_layer_norm.bias", "text_encoder.layers.9.ffn.fc1.weight", "text_encoder.layers.9.ffn.fc1.bias", "text_encoder.layers.9.ffn.fc2.weight", "text_encoder.layers.9.ffn.fc2.bias", "text_encoder.layers.9.ffn_layer_norm.weight", "text_encoder.layers.9.ffn_layer_norm.bias", "text_encoder.layers.10.self_attn.k_proj.weight", "text_encoder.layers.10.self_attn.k_proj.bias", "text_encoder.layers.10.self_attn.v_proj.weight", "text_encoder.layers.10.self_attn.v_proj.bias", "text_encoder.layers.10.self_attn.q_proj.weight", "text_encoder.layers.10.self_attn.q_proj.bias", "text_encoder.layers.10.self_attn.out_proj.weight", "text_encoder.layers.10.self_attn.out_proj.bias", "text_encoder.layers.10.self_attn_layer_norm.weight", "text_encoder.layers.10.self_attn_layer_norm.bias", "text_encoder.layers.10.ffn.fc1.weight", "text_encoder.layers.10.ffn.fc1.bias", "text_encoder.layers.10.ffn.fc2.weight", "text_encoder.layers.10.ffn.fc2.bias", "text_encoder.layers.10.ffn_layer_norm.weight", "text_encoder.layers.10.ffn_layer_norm.bias", "text_encoder.layers.11.self_attn.k_proj.weight", "text_encoder.layers.11.self_attn.k_proj.bias", "text_encoder.layers.11.self_attn.v_proj.weight", "text_encoder.layers.11.self_attn.v_proj.bias", "text_encoder.layers.11.self_attn.q_proj.weight", "text_encoder.layers.11.self_attn.q_proj.bias", "text_encoder.layers.11.self_attn.out_proj.weight", "text_encoder.layers.11.self_attn.out_proj.bias", "text_encoder.layers.11.self_attn_layer_norm.weight", "text_encoder.layers.11.self_attn_layer_norm.bias", "text_encoder.layers.11.ffn.fc1.weight", "text_encoder.layers.11.ffn.fc1.bias", "text_encoder.layers.11.ffn.fc2.weight", "text_encoder.layers.11.ffn.fc2.bias", "text_encoder.layers.11.ffn_layer_norm.weight", "text_encoder.layers.11.ffn_layer_norm.bias", "text_encoder.layer_norm.weight", "text_encoder.layer_norm.bias", "speech_encoder.feature_projection.layer_norm.weight", "speech_encoder.feature_projection.layer_norm.bias", "speech_encoder.feature_projection.projection.weight", "speech_encoder.feature_projection.projection.bias", "speech_encoder.encoder.layers.0.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.0.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.0.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.0.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.0.ffn1.output_dense.weight", "speech_encoder.encoder.layers.0.ffn1.output_dense.bias", "speech_encoder.encoder.layers.0.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.0.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.0.self_attn.pos_bias_u", "speech_encoder.encoder.layers.0.self_attn.pos_bias_v", "speech_encoder.encoder.layers.0.self_attn.linear_q.weight", "speech_encoder.encoder.layers.0.self_attn.linear_q.bias", "speech_encoder.encoder.layers.0.self_attn.linear_k.weight", "speech_encoder.encoder.layers.0.self_attn.linear_k.bias", "speech_encoder.encoder.layers.0.self_attn.linear_v.weight", "speech_encoder.encoder.layers.0.self_attn.linear_v.bias", "speech_encoder.encoder.layers.0.self_attn.linear_out.weight", "speech_encoder.encoder.layers.0.self_attn.linear_out.bias", "speech_encoder.encoder.layers.0.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.0.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.0.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.0.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.0.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.0.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.0.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.0.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.0.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.0.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.0.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.0.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.0.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.0.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.0.ffn2.output_dense.weight", "speech_encoder.encoder.layers.0.ffn2.output_dense.bias", "speech_encoder.encoder.layers.0.final_layer_norm.weight", "speech_encoder.encoder.layers.0.final_layer_norm.bias", "speech_encoder.encoder.layers.1.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.1.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.1.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.1.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.1.ffn1.output_dense.weight", "speech_encoder.encoder.layers.1.ffn1.output_dense.bias", "speech_encoder.encoder.layers.1.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.1.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.1.self_attn.pos_bias_u", "speech_encoder.encoder.layers.1.self_attn.pos_bias_v", "speech_encoder.encoder.layers.1.self_attn.linear_q.weight", "speech_encoder.encoder.layers.1.self_attn.linear_q.bias", "speech_encoder.encoder.layers.1.self_attn.linear_k.weight", "speech_encoder.encoder.layers.1.self_attn.linear_k.bias", "speech_encoder.encoder.layers.1.self_attn.linear_v.weight", "speech_encoder.encoder.layers.1.self_attn.linear_v.bias", "speech_encoder.encoder.layers.1.self_attn.linear_out.weight", "speech_encoder.encoder.layers.1.self_attn.linear_out.bias", "speech_encoder.encoder.layers.1.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.1.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.1.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.1.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.1.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.1.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.1.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.1.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.1.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.1.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.1.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.1.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.1.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.1.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.1.ffn2.output_dense.weight", "speech_encoder.encoder.layers.1.ffn2.output_dense.bias", "speech_encoder.encoder.layers.1.final_layer_norm.weight", "speech_encoder.encoder.layers.1.final_layer_norm.bias", "speech_encoder.encoder.layers.2.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.2.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.2.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.2.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.2.ffn1.output_dense.weight", "speech_encoder.encoder.layers.2.ffn1.output_dense.bias", "speech_encoder.encoder.layers.2.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.2.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.2.self_attn.pos_bias_u", "speech_encoder.encoder.layers.2.self_attn.pos_bias_v", "speech_encoder.encoder.layers.2.self_attn.linear_q.weight", "speech_encoder.encoder.layers.2.self_attn.linear_q.bias", "speech_encoder.encoder.layers.2.self_attn.linear_k.weight", "speech_encoder.encoder.layers.2.self_attn.linear_k.bias", "speech_encoder.encoder.layers.2.self_attn.linear_v.weight", "speech_encoder.encoder.layers.2.self_attn.linear_v.bias", "speech_encoder.encoder.layers.2.self_attn.linear_out.weight", "speech_encoder.encoder.layers.2.self_attn.linear_out.bias", "speech_encoder.encoder.layers.2.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.2.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.2.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.2.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.2.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.2.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.2.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.2.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.2.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.2.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.2.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.2.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.2.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.2.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.2.ffn2.output_dense.weight", "speech_encoder.encoder.layers.2.ffn2.output_dense.bias", "speech_encoder.encoder.layers.2.final_layer_norm.weight", "speech_encoder.encoder.layers.2.final_layer_norm.bias", "speech_encoder.encoder.layers.3.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.3.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.3.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.3.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.3.ffn1.output_dense.weight", "speech_encoder.encoder.layers.3.ffn1.output_dense.bias", "speech_encoder.encoder.layers.3.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.3.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.3.self_attn.pos_bias_u", "speech_encoder.encoder.layers.3.self_attn.pos_bias_v", "speech_encoder.encoder.layers.3.self_attn.linear_q.weight", "speech_encoder.encoder.layers.3.self_attn.linear_q.bias", "speech_encoder.encoder.layers.3.self_attn.linear_k.weight", "speech_encoder.encoder.layers.3.self_attn.linear_k.bias", "speech_encoder.encoder.layers.3.self_attn.linear_v.weight", "speech_encoder.encoder.layers.3.self_attn.linear_v.bias", "speech_encoder.encoder.layers.3.self_attn.linear_out.weight", "speech_encoder.encoder.layers.3.self_attn.linear_out.bias", "speech_encoder.encoder.layers.3.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.3.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.3.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.3.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.3.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.3.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.3.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.3.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.3.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.3.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.3.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.3.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.3.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.3.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.3.ffn2.output_dense.weight", "speech_encoder.encoder.layers.3.ffn2.output_dense.bias", "speech_encoder.encoder.layers.3.final_layer_norm.weight", "speech_encoder.encoder.layers.3.final_layer_norm.bias", "speech_encoder.encoder.layers.4.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.4.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.4.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.4.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.4.ffn1.output_dense.weight", "speech_encoder.encoder.layers.4.ffn1.output_dense.bias", "speech_encoder.encoder.layers.4.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.4.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.4.self_attn.pos_bias_u", "speech_encoder.encoder.layers.4.self_attn.pos_bias_v", "speech_encoder.encoder.layers.4.self_attn.linear_q.weight", "speech_encoder.encoder.layers.4.self_attn.linear_q.bias", "speech_encoder.encoder.layers.4.self_attn.linear_k.weight", "speech_encoder.encoder.layers.4.self_attn.linear_k.bias", "speech_encoder.encoder.layers.4.self_attn.linear_v.weight", "speech_encoder.encoder.layers.4.self_attn.linear_v.bias", "speech_encoder.encoder.layers.4.self_attn.linear_out.weight", "speech_encoder.encoder.layers.4.self_attn.linear_out.bias", "speech_encoder.encoder.layers.4.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.4.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.4.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.4.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.4.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.4.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.4.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.4.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.4.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.4.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.4.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.4.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.4.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.4.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.4.ffn2.output_dense.weight", "speech_encoder.encoder.layers.4.ffn2.output_dense.bias", "speech_encoder.encoder.layers.4.final_layer_norm.weight", "speech_encoder.encoder.layers.4.final_layer_norm.bias", "speech_encoder.encoder.layers.5.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.5.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.5.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.5.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.5.ffn1.output_dense.weight", "speech_encoder.encoder.layers.5.ffn1.output_dense.bias", "speech_encoder.encoder.layers.5.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.5.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.5.self_attn.pos_bias_u", "speech_encoder.encoder.layers.5.self_attn.pos_bias_v", "speech_encoder.encoder.layers.5.self_attn.linear_q.weight", "speech_encoder.encoder.layers.5.self_attn.linear_q.bias", "speech_encoder.encoder.layers.5.self_attn.linear_k.weight", "speech_encoder.encoder.layers.5.self_attn.linear_k.bias", "speech_encoder.encoder.layers.5.self_attn.linear_v.weight", "speech_encoder.encoder.layers.5.self_attn.linear_v.bias", "speech_encoder.encoder.layers.5.self_attn.linear_out.weight", "speech_encoder.encoder.layers.5.self_attn.linear_out.bias", "speech_encoder.encoder.layers.5.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.5.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.5.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.5.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.5.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.5.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.5.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.5.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.5.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.5.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.5.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.5.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.5.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.5.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.5.ffn2.output_dense.weight", "speech_encoder.encoder.layers.5.ffn2.output_dense.bias", "speech_encoder.encoder.layers.5.final_layer_norm.weight", "speech_encoder.encoder.layers.5.final_layer_norm.bias", "speech_encoder.encoder.layers.6.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.6.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.6.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.6.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.6.ffn1.output_dense.weight", "speech_encoder.encoder.layers.6.ffn1.output_dense.bias", "speech_encoder.encoder.layers.6.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.6.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.6.self_attn.pos_bias_u", "speech_encoder.encoder.layers.6.self_attn.pos_bias_v", "speech_encoder.encoder.layers.6.self_attn.linear_q.weight", "speech_encoder.encoder.layers.6.self_attn.linear_q.bias", "speech_encoder.encoder.layers.6.self_attn.linear_k.weight", "speech_encoder.encoder.layers.6.self_attn.linear_k.bias", "speech_encoder.encoder.layers.6.self_attn.linear_v.weight", "speech_encoder.encoder.layers.6.self_attn.linear_v.bias", "speech_encoder.encoder.layers.6.self_attn.linear_out.weight", "speech_encoder.encoder.layers.6.self_attn.linear_out.bias", "speech_encoder.encoder.layers.6.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.6.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.6.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.6.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.6.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.6.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.6.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.6.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.6.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.6.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.6.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.6.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.6.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.6.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.6.ffn2.output_dense.weight", "speech_encoder.encoder.layers.6.ffn2.output_dense.bias", "speech_encoder.encoder.layers.6.final_layer_norm.weight", "speech_encoder.encoder.layers.6.final_layer_norm.bias", "speech_encoder.encoder.layers.7.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.7.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.7.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.7.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.7.ffn1.output_dense.weight", "speech_encoder.encoder.layers.7.ffn1.output_dense.bias", "speech_encoder.encoder.layers.7.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.7.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.7.self_attn.pos_bias_u", "speech_encoder.encoder.layers.7.self_attn.pos_bias_v", "speech_encoder.encoder.layers.7.self_attn.linear_q.weight", "speech_encoder.encoder.layers.7.self_attn.linear_q.bias", "speech_encoder.encoder.layers.7.self_attn.linear_k.weight", "speech_encoder.encoder.layers.7.self_attn.linear_k.bias", "speech_encoder.encoder.layers.7.self_attn.linear_v.weight", "speech_encoder.encoder.layers.7.self_attn.linear_v.bias", "speech_encoder.encoder.layers.7.self_attn.linear_out.weight", "speech_encoder.encoder.layers.7.self_attn.linear_out.bias", "speech_encoder.encoder.layers.7.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.7.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.7.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.7.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.7.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.7.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.7.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.7.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.7.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.7.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.7.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.7.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.7.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.7.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.7.ffn2.output_dense.weight", "speech_encoder.encoder.layers.7.ffn2.output_dense.bias", "speech_encoder.encoder.layers.7.final_layer_norm.weight", "speech_encoder.encoder.layers.7.final_layer_norm.bias", "speech_encoder.encoder.layers.8.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.8.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.8.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.8.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.8.ffn1.output_dense.weight", "speech_encoder.encoder.layers.8.ffn1.output_dense.bias", "speech_encoder.encoder.layers.8.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.8.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.8.self_attn.pos_bias_u", "speech_encoder.encoder.layers.8.self_attn.pos_bias_v", "speech_encoder.encoder.layers.8.self_attn.linear_q.weight", "speech_encoder.encoder.layers.8.self_attn.linear_q.bias", "speech_encoder.encoder.layers.8.self_attn.linear_k.weight", "speech_encoder.encoder.layers.8.self_attn.linear_k.bias", "speech_encoder.encoder.layers.8.self_attn.linear_v.weight", "speech_encoder.encoder.layers.8.self_attn.linear_v.bias", "speech_encoder.encoder.layers.8.self_attn.linear_out.weight", "speech_encoder.encoder.layers.8.self_attn.linear_out.bias", "speech_encoder.encoder.layers.8.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.8.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.8.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.8.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.8.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.8.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.8.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.8.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.8.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.8.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.8.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.8.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.8.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.8.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.8.ffn2.output_dense.weight", "speech_encoder.encoder.layers.8.ffn2.output_dense.bias", "speech_encoder.encoder.layers.8.final_layer_norm.weight", "speech_encoder.encoder.layers.8.final_layer_norm.bias", "speech_encoder.encoder.layers.9.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.9.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.9.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.9.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.9.ffn1.output_dense.weight", "speech_encoder.encoder.layers.9.ffn1.output_dense.bias", "speech_encoder.encoder.layers.9.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.9.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.9.self_attn.pos_bias_u", "speech_encoder.encoder.layers.9.self_attn.pos_bias_v", "speech_encoder.encoder.layers.9.self_attn.linear_q.weight", "speech_encoder.encoder.layers.9.self_attn.linear_q.bias", "speech_encoder.encoder.layers.9.self_attn.linear_k.weight", "speech_encoder.encoder.layers.9.self_attn.linear_k.bias", "speech_encoder.encoder.layers.9.self_attn.linear_v.weight", "speech_encoder.encoder.layers.9.self_attn.linear_v.bias", "speech_encoder.encoder.layers.9.self_attn.linear_out.weight", "speech_encoder.encoder.layers.9.self_attn.linear_out.bias", "speech_encoder.encoder.layers.9.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.9.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.9.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.9.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.9.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.9.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.9.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.9.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.9.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.9.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.9.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.9.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.9.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.9.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.9.ffn2.output_dense.weight", "speech_encoder.encoder.layers.9.ffn2.output_dense.bias", "speech_encoder.encoder.layers.9.final_layer_norm.weight", "speech_encoder.encoder.layers.9.final_layer_norm.bias", "speech_encoder.encoder.layers.10.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.10.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.10.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.10.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.10.ffn1.output_dense.weight", "speech_encoder.encoder.layers.10.ffn1.output_dense.bias", "speech_encoder.encoder.layers.10.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.10.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.10.self_attn.pos_bias_u", "speech_encoder.encoder.layers.10.self_attn.pos_bias_v", "speech_encoder.encoder.layers.10.self_attn.linear_q.weight", "speech_encoder.encoder.layers.10.self_attn.linear_q.bias", "speech_encoder.encoder.layers.10.self_attn.linear_k.weight", "speech_encoder.encoder.layers.10.self_attn.linear_k.bias", "speech_encoder.encoder.layers.10.self_attn.linear_v.weight", "speech_encoder.encoder.layers.10.self_attn.linear_v.bias", "speech_encoder.encoder.layers.10.self_attn.linear_out.weight", "speech_encoder.encoder.layers.10.self_attn.linear_out.bias", "speech_encoder.encoder.layers.10.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.10.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.10.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.10.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.10.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.10.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.10.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.10.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.10.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.10.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.10.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.10.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.10.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.10.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.10.ffn2.output_dense.weight", "speech_encoder.encoder.layers.10.ffn2.output_dense.bias", "speech_encoder.encoder.layers.10.final_layer_norm.weight", "speech_encoder.encoder.layers.10.final_layer_norm.bias", "speech_encoder.encoder.layers.11.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.11.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.11.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.11.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.11.ffn1.output_dense.weight", "speech_encoder.encoder.layers.11.ffn1.output_dense.bias", "speech_encoder.encoder.layers.11.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.11.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.11.self_attn.pos_bias_u", "speech_encoder.encoder.layers.11.self_attn.pos_bias_v", "speech_encoder.encoder.layers.11.self_attn.linear_q.weight", "speech_encoder.encoder.layers.11.self_attn.linear_q.bias", "speech_encoder.encoder.layers.11.self_attn.linear_k.weight", "speech_encoder.encoder.layers.11.self_attn.linear_k.bias", "speech_encoder.encoder.layers.11.self_attn.linear_v.weight", "speech_encoder.encoder.layers.11.self_attn.linear_v.bias", "speech_encoder.encoder.layers.11.self_attn.linear_out.weight", "speech_encoder.encoder.layers.11.self_attn.linear_out.bias", "speech_encoder.encoder.layers.11.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.11.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.11.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.11.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.11.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.11.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.11.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.11.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.11.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.11.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.11.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.11.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.11.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.11.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.11.ffn2.output_dense.weight", "speech_encoder.encoder.layers.11.ffn2.output_dense.bias", "speech_encoder.encoder.layers.11.final_layer_norm.weight", "speech_encoder.encoder.layers.11.final_layer_norm.bias", "speech_encoder.encoder.layer_norm.weight", "speech_encoder.encoder.layer_norm.bias", "speech_encoder.intermediate_ffn.intermediate_dense.weight", "speech_encoder.intermediate_ffn.intermediate_dense.bias", "speech_encoder.intermediate_ffn.output_dense.weight", "speech_encoder.intermediate_ffn.output_dense.bias", "speech_encoder.adapter.layers.0.residual_layer_norm.weight", "speech_encoder.adapter.layers.0.residual_layer_norm.bias", "speech_encoder.adapter.layers.0.residual_conv.weight", "speech_encoder.adapter.layers.0.residual_conv.bias", "speech_encoder.adapter.layers.0.self_attn_layer_norm.weight", "speech_encoder.adapter.layers.0.self_attn_layer_norm.bias", "speech_encoder.adapter.layers.0.self_attn_conv.weight", "speech_encoder.adapter.layers.0.self_attn_conv.bias", "speech_encoder.adapter.layers.0.self_attn.linear_q.weight", "speech_encoder.adapter.layers.0.self_attn.linear_q.bias", "speech_encoder.adapter.layers.0.self_attn.linear_k.weight", "speech_encoder.adapter.layers.0.self_attn.linear_k.bias", "speech_encoder.adapter.layers.0.self_attn.linear_v.weight", "speech_encoder.adapter.layers.0.self_attn.linear_v.bias", "speech_encoder.adapter.layers.0.self_attn.linear_out.weight", "speech_encoder.adapter.layers.0.self_attn.linear_out.bias", "speech_encoder.adapter.layers.0.ffn_layer_norm.weight", "speech_encoder.adapter.layers.0.ffn_layer_norm.bias", "speech_encoder.adapter.layers.0.ffn.intermediate_dense.weight", "speech_encoder.adapter.layers.0.ffn.intermediate_dense.bias", "speech_encoder.adapter.layers.0.ffn.output_dense.weight", "speech_encoder.adapter.layers.0.ffn.output_dense.bias", "speech_encoder.inner_layer_norm.weight", "speech_encoder.inner_layer_norm.bias", "text_decoder.embed_tokens.weight", "text_decoder.layers.0.self_attn.k_proj.weight", "text_decoder.layers.0.self_attn.k_proj.bias", "text_decoder.layers.0.self_attn.v_proj.weight", "text_decoder.layers.0.self_attn.v_proj.bias", "text_decoder.layers.0.self_attn.q_proj.weight", "text_decoder.layers.0.self_attn.q_proj.bias", "text_decoder.layers.0.self_attn.out_proj.weight", "text_decoder.layers.0.self_attn.out_proj.bias", "text_decoder.layers.0.self_attn_layer_norm.weight", "text_decoder.layers.0.self_attn_layer_norm.bias", "text_decoder.layers.0.cross_attention.k_proj.weight", "text_decoder.layers.0.cross_attention.k_proj.bias", "text_decoder.layers.0.cross_attention.v_proj.weight", "text_decoder.layers.0.cross_attention.v_proj.bias", "text_decoder.layers.0.cross_attention.q_proj.weight", "text_decoder.layers.0.cross_attention.q_proj.bias", "text_decoder.layers.0.cross_attention.out_proj.weight", "text_decoder.layers.0.cross_attention.out_proj.bias", "text_decoder.layers.0.cross_attention_layer_norm.weight", "text_decoder.layers.0.cross_attention_layer_norm.bias", "text_decoder.layers.0.ffn.fc1.weight", "text_decoder.layers.0.ffn.fc1.bias", "text_decoder.layers.0.ffn.fc2.weight", "text_decoder.layers.0.ffn.fc2.bias", "text_decoder.layers.0.ffn_layer_norm.weight", "text_decoder.layers.0.ffn_layer_norm.bias", "text_decoder.layers.1.self_attn.k_proj.weight", "text_decoder.layers.1.self_attn.k_proj.bias", "text_decoder.layers.1.self_attn.v_proj.weight", "text_decoder.layers.1.self_attn.v_proj.bias", "text_decoder.layers.1.self_attn.q_proj.weight", "text_decoder.layers.1.self_attn.q_proj.bias", "text_decoder.layers.1.self_attn.out_proj.weight", "text_decoder.layers.1.self_attn.out_proj.bias", "text_decoder.layers.1.self_attn_layer_norm.weight", "text_decoder.layers.1.self_attn_layer_norm.bias", "text_decoder.layers.1.cross_attention.k_proj.weight", "text_decoder.layers.1.cross_attention.k_proj.bias", "text_decoder.layers.1.cross_attention.v_proj.weight", "text_decoder.layers.1.cross_attention.v_proj.bias", "text_decoder.layers.1.cross_attention.q_proj.weight", "text_decoder.layers.1.cross_attention.q_proj.bias", "text_decoder.layers.1.cross_attention.out_proj.weight", "text_decoder.layers.1.cross_attention.out_proj.bias", "text_decoder.layers.1.cross_attention_layer_norm.weight", "text_decoder.layers.1.cross_attention_layer_norm.bias", "text_decoder.layers.1.ffn.fc1.weight", "text_decoder.layers.1.ffn.fc1.bias", "text_decoder.layers.1.ffn.fc2.weight", "text_decoder.layers.1.ffn.fc2.bias", "text_decoder.layers.1.ffn_layer_norm.weight", "text_decoder.layers.1.ffn_layer_norm.bias", "text_decoder.layers.2.self_attn.k_proj.weight", "text_decoder.layers.2.self_attn.k_proj.bias", "text_decoder.layers.2.self_attn.v_proj.weight", "text_decoder.layers.2.self_attn.v_proj.bias", "text_decoder.layers.2.self_attn.q_proj.weight", "text_decoder.layers.2.self_attn.q_proj.bias", "text_decoder.layers.2.self_attn.out_proj.weight", "text_decoder.layers.2.self_attn.out_proj.bias", "text_decoder.layers.2.self_attn_layer_norm.weight", "text_decoder.layers.2.self_attn_layer_norm.bias", "text_decoder.layers.2.cross_attention.k_proj.weight", "text_decoder.layers.2.cross_attention.k_proj.bias", "text_decoder.layers.2.cross_attention.v_proj.weight", "text_decoder.layers.2.cross_attention.v_proj.bias", "text_decoder.layers.2.cross_attention.q_proj.weight", "text_decoder.layers.2.cross_attention.q_proj.bias", "text_decoder.layers.2.cross_attention.out_proj.weight", "text_decoder.layers.2.cross_attention.out_proj.bias", "text_decoder.layers.2.cross_attention_layer_norm.weight", "text_decoder.layers.2.cross_attention_layer_norm.bias", "text_decoder.layers.2.ffn.fc1.weight", "text_decoder.layers.2.ffn.fc1.bias", "text_decoder.layers.2.ffn.fc2.weight", "text_decoder.layers.2.ffn.fc2.bias", "text_decoder.layers.2.ffn_layer_norm.weight", "text_decoder.layers.2.ffn_layer_norm.bias", "text_decoder.layers.3.self_attn.k_proj.weight", "text_decoder.layers.3.self_attn.k_proj.bias", "text_decoder.layers.3.self_attn.v_proj.weight", "text_decoder.layers.3.self_attn.v_proj.bias", "text_decoder.layers.3.self_attn.q_proj.weight", "text_decoder.layers.3.self_attn.q_proj.bias", "text_decoder.layers.3.self_attn.out_proj.weight", "text_decoder.layers.3.self_attn.out_proj.bias", "text_decoder.layers.3.self_attn_layer_norm.weight", "text_decoder.layers.3.self_attn_layer_norm.bias", "text_decoder.layers.3.cross_attention.k_proj.weight", "text_decoder.layers.3.cross_attention.k_proj.bias", "text_decoder.layers.3.cross_attention.v_proj.weight", "text_decoder.layers.3.cross_attention.v_proj.bias", "text_decoder.layers.3.cross_attention.q_proj.weight", "text_decoder.layers.3.cross_attention.q_proj.bias", "text_decoder.layers.3.cross_attention.out_proj.weight", "text_decoder.layers.3.cross_attention.out_proj.bias", "text_decoder.layers.3.cross_attention_layer_norm.weight", "text_decoder.layers.3.cross_attention_layer_norm.bias", "text_decoder.layers.3.ffn.fc1.weight", "text_decoder.layers.3.ffn.fc1.bias", "text_decoder.layers.3.ffn.fc2.weight", "text_decoder.layers.3.ffn.fc2.bias", "text_decoder.layers.3.ffn_layer_norm.weight", "text_decoder.layers.3.ffn_layer_norm.bias", "text_decoder.layers.4.self_attn.k_proj.weight", "text_decoder.layers.4.self_attn.k_proj.bias", "text_decoder.layers.4.self_attn.v_proj.weight", "text_decoder.layers.4.self_attn.v_proj.bias", "text_decoder.layers.4.self_attn.q_proj.weight", "text_decoder.layers.4.self_attn.q_proj.bias", "text_decoder.layers.4.self_attn.out_proj.weight", "text_decoder.layers.4.self_attn.out_proj.bias", "text_decoder.layers.4.self_attn_layer_norm.weight", "text_decoder.layers.4.self_attn_layer_norm.bias", "text_decoder.layers.4.cross_attention.k_proj.weight", "text_decoder.layers.4.cross_attention.k_proj.bias", "text_decoder.layers.4.cross_attention.v_proj.weight", "text_decoder.layers.4.cross_attention.v_proj.bias", "text_decoder.layers.4.cross_attention.q_proj.weight", "text_decoder.layers.4.cross_attention.q_proj.bias", "text_decoder.layers.4.cross_attention.out_proj.weight", "text_decoder.layers.4.cross_attention.out_proj.bias", "text_decoder.layers.4.cross_attention_layer_norm.weight", "text_decoder.layers.4.cross_attention_layer_norm.bias", "text_decoder.layers.4.ffn.fc1.weight", "text_decoder.layers.4.ffn.fc1.bias", "text_decoder.layers.4.ffn.fc2.weight", "text_decoder.layers.4.ffn.fc2.bias", "text_decoder.layers.4.ffn_layer_norm.weight", "text_decoder.layers.4.ffn_layer_norm.bias", "text_decoder.layers.5.self_attn.k_proj.weight", "text_decoder.layers.5.self_attn.k_proj.bias", "text_decoder.layers.5.self_attn.v_proj.weight", "text_decoder.layers.5.self_attn.v_proj.bias", "text_decoder.layers.5.self_attn.q_proj.weight", "text_decoder.layers.5.self_attn.q_proj.bias", "text_decoder.layers.5.self_attn.out_proj.weight", "text_decoder.layers.5.self_attn.out_proj.bias", "text_decoder.layers.5.self_attn_layer_norm.weight", "text_decoder.layers.5.self_attn_layer_norm.bias", "text_decoder.layers.5.cross_attention.k_proj.weight", "text_decoder.layers.5.cross_attention.k_proj.bias", "text_decoder.layers.5.cross_attention.v_proj.weight", "text_decoder.layers.5.cross_attention.v_proj.bias", "text_decoder.layers.5.cross_attention.q_proj.weight", "text_decoder.layers.5.cross_attention.q_proj.bias", "text_decoder.layers.5.cross_attention.out_proj.weight", "text_decoder.layers.5.cross_attention.out_proj.bias", "text_decoder.layers.5.cross_attention_layer_norm.weight", "text_decoder.layers.5.cross_attention_layer_norm.bias", "text_decoder.layers.5.ffn.fc1.weight", "text_decoder.layers.5.ffn.fc1.bias", "text_decoder.layers.5.ffn.fc2.weight", "text_decoder.layers.5.ffn.fc2.bias", "text_decoder.layers.5.ffn_layer_norm.weight", "text_decoder.layers.5.ffn_layer_norm.bias", "text_decoder.layers.6.self_attn.k_proj.weight", "text_decoder.layers.6.self_attn.k_proj.bias", "text_decoder.layers.6.self_attn.v_proj.weight", "text_decoder.layers.6.self_attn.v_proj.bias", "text_decoder.layers.6.self_attn.q_proj.weight", "text_decoder.layers.6.self_attn.q_proj.bias", "text_decoder.layers.6.self_attn.out_proj.weight", "text_decoder.layers.6.self_attn.out_proj.bias", "text_decoder.layers.6.self_attn_layer_norm.weight", "text_decoder.layers.6.self_attn_layer_norm.bias", "text_decoder.layers.6.cross_attention.k_proj.weight", "text_decoder.layers.6.cross_attention.k_proj.bias", "text_decoder.layers.6.cross_attention.v_proj.weight", "text_decoder.layers.6.cross_attention.v_proj.bias", "text_decoder.layers.6.cross_attention.q_proj.weight", "text_decoder.layers.6.cross_attention.q_proj.bias", "text_decoder.layers.6.cross_attention.out_proj.weight", "text_decoder.layers.6.cross_attention.out_proj.bias", "text_decoder.layers.6.cross_attention_layer_norm.weight", "text_decoder.layers.6.cross_attention_layer_norm.bias", "text_decoder.layers.6.ffn.fc1.weight", "text_decoder.layers.6.ffn.fc1.bias", "text_decoder.layers.6.ffn.fc2.weight", "text_decoder.layers.6.ffn.fc2.bias", "text_decoder.layers.6.ffn_layer_norm.weight", "text_decoder.layers.6.ffn_layer_norm.bias", "text_decoder.layers.7.self_attn.k_proj.weight", "text_decoder.layers.7.self_attn.k_proj.bias", "text_decoder.layers.7.self_attn.v_proj.weight", "text_decoder.layers.7.self_attn.v_proj.bias", "text_decoder.layers.7.self_attn.q_proj.weight", "text_decoder.layers.7.self_attn.q_proj.bias", "text_decoder.layers.7.self_attn.out_proj.weight", "text_decoder.layers.7.self_attn.out_proj.bias", "text_decoder.layers.7.self_attn_layer_norm.weight", "text_decoder.layers.7.self_attn_layer_norm.bias", "text_decoder.layers.7.cross_attention.k_proj.weight", "text_decoder.layers.7.cross_attention.k_proj.bias", "text_decoder.layers.7.cross_attention.v_proj.weight", "text_decoder.layers.7.cross_attention.v_proj.bias", "text_decoder.layers.7.cross_attention.q_proj.weight", "text_decoder.layers.7.cross_attention.q_proj.bias", "text_decoder.layers.7.cross_attention.out_proj.weight", "text_decoder.layers.7.cross_attention.out_proj.bias", "text_decoder.layers.7.cross_attention_layer_norm.weight", "text_decoder.layers.7.cross_attention_layer_norm.bias", "text_decoder.layers.7.ffn.fc1.weight", "text_decoder.layers.7.ffn.fc1.bias", "text_decoder.layers.7.ffn.fc2.weight", "text_decoder.layers.7.ffn.fc2.bias", "text_decoder.layers.7.ffn_layer_norm.weight", "text_decoder.layers.7.ffn_layer_norm.bias", "text_decoder.layers.8.self_attn.k_proj.weight", "text_decoder.layers.8.self_attn.k_proj.bias", "text_decoder.layers.8.self_attn.v_proj.weight", "text_decoder.layers.8.self_attn.v_proj.bias", "text_decoder.layers.8.self_attn.q_proj.weight", "text_decoder.layers.8.self_attn.q_proj.bias", "text_decoder.layers.8.self_attn.out_proj.weight", "text_decoder.layers.8.self_attn.out_proj.bias", "text_decoder.layers.8.self_attn_layer_norm.weight", "text_decoder.layers.8.self_attn_layer_norm.bias", "text_decoder.layers.8.cross_attention.k_proj.weight", "text_decoder.layers.8.cross_attention.k_proj.bias", "text_decoder.layers.8.cross_attention.v_proj.weight", "text_decoder.layers.8.cross_attention.v_proj.bias", "text_decoder.layers.8.cross_attention.q_proj.weight", "text_decoder.layers.8.cross_attention.q_proj.bias", "text_decoder.layers.8.cross_attention.out_proj.weight", "text_decoder.layers.8.cross_attention.out_proj.bias", "text_decoder.layers.8.cross_attention_layer_norm.weight", "text_decoder.layers.8.cross_attention_layer_norm.bias", "text_decoder.layers.8.ffn.fc1.weight", "text_decoder.layers.8.ffn.fc1.bias", "text_decoder.layers.8.ffn.fc2.weight", "text_decoder.layers.8.ffn.fc2.bias", "text_decoder.layers.8.ffn_layer_norm.weight", "text_decoder.layers.8.ffn_layer_norm.bias", "text_decoder.layers.9.self_attn.k_proj.weight", "text_decoder.layers.9.self_attn.k_proj.bias", "text_decoder.layers.9.self_attn.v_proj.weight", "text_decoder.layers.9.self_attn.v_proj.bias", "text_decoder.layers.9.self_attn.q_proj.weight", "text_decoder.layers.9.self_attn.q_proj.bias", "text_decoder.layers.9.self_attn.out_proj.weight", "text_decoder.layers.9.self_attn.out_proj.bias", "text_decoder.layers.9.self_attn_layer_norm.weight", "text_decoder.layers.9.self_attn_layer_norm.bias", "text_decoder.layers.9.cross_attention.k_proj.weight", "text_decoder.layers.9.cross_attention.k_proj.bias", "text_decoder.layers.9.cross_attention.v_proj.weight", "text_decoder.layers.9.cross_attention.v_proj.bias", "text_decoder.layers.9.cross_attention.q_proj.weight", "text_decoder.layers.9.cross_attention.q_proj.bias", "text_decoder.layers.9.cross_attention.out_proj.weight", "text_decoder.layers.9.cross_attention.out_proj.bias", "text_decoder.layers.9.cross_attention_layer_norm.weight", "text_decoder.layers.9.cross_attention_layer_norm.bias", "text_decoder.layers.9.ffn.fc1.weight", "text_decoder.layers.9.ffn.fc1.bias", "text_decoder.layers.9.ffn.fc2.weight", "text_decoder.layers.9.ffn.fc2.bias", "text_decoder.layers.9.ffn_layer_norm.weight", "text_decoder.layers.9.ffn_layer_norm.bias", "text_decoder.layers.10.self_attn.k_proj.weight", "text_decoder.layers.10.self_attn.k_proj.bias", "text_decoder.layers.10.self_attn.v_proj.weight", "text_decoder.layers.10.self_attn.v_proj.bias", "text_decoder.layers.10.self_attn.q_proj.weight", "text_decoder.layers.10.self_attn.q_proj.bias", "text_decoder.layers.10.self_attn.out_proj.weight", "text_decoder.layers.10.self_attn.out_proj.bias", "text_decoder.layers.10.self_attn_layer_norm.weight", "text_decoder.layers.10.self_attn_layer_norm.bias", "text_decoder.layers.10.cross_attention.k_proj.weight", "text_decoder.layers.10.cross_attention.k_proj.bias", "text_decoder.layers.10.cross_attention.v_proj.weight", "text_decoder.layers.10.cross_attention.v_proj.bias", "text_decoder.layers.10.cross_attention.q_proj.weight", "text_decoder.layers.10.cross_attention.q_proj.bias", "text_decoder.layers.10.cross_attention.out_proj.weight", "text_decoder.layers.10.cross_attention.out_proj.bias", "text_decoder.layers.10.cross_attention_layer_norm.weight", "text_decoder.layers.10.cross_attention_layer_norm.bias", "text_decoder.layers.10.ffn.fc1.weight", "text_decoder.layers.10.ffn.fc1.bias", "text_decoder.layers.10.ffn.fc2.weight", "text_decoder.layers.10.ffn.fc2.bias", "text_decoder.layers.10.ffn_layer_norm.weight", "text_decoder.layers.10.ffn_layer_norm.bias", "text_decoder.layers.11.self_attn.k_proj.weight", "text_decoder.layers.11.self_attn.k_proj.bias", "text_decoder.layers.11.self_attn.v_proj.weight", "text_decoder.layers.11.self_attn.v_proj.bias", "text_decoder.layers.11.self_attn.q_proj.weight", "text_decoder.layers.11.self_attn.q_proj.bias", "text_decoder.layers.11.self_attn.out_proj.weight", "text_decoder.layers.11.self_attn.out_proj.bias", "text_decoder.layers.11.self_attn_layer_norm.weight", "text_decoder.layers.11.self_attn_layer_norm.bias", "text_decoder.layers.11.cross_attention.k_proj.weight", "text_decoder.layers.11.cross_attention.k_proj.bias", "text_decoder.layers.11.cross_attention.v_proj.weight", "text_decoder.layers.11.cross_attention.v_proj.bias", "text_decoder.layers.11.cross_attention.q_proj.weight", "text_decoder.layers.11.cross_attention.q_proj.bias", "text_decoder.layers.11.cross_attention.out_proj.weight", "text_decoder.layers.11.cross_attention.out_proj.bias", "text_decoder.layers.11.cross_attention_layer_norm.weight", "text_decoder.layers.11.cross_attention_layer_norm.bias", "text_decoder.layers.11.ffn.fc1.weight", "text_decoder.layers.11.ffn.fc1.bias", "text_decoder.layers.11.ffn.fc2.weight", "text_decoder.layers.11.ffn.fc2.bias", "text_decoder.layers.11.ffn_layer_norm.weight", "text_decoder.layers.11.ffn_layer_norm.bias", "text_decoder.layer_norm.weight", "text_decoder.layer_norm.bias", "lm_head.weight", "t2u_model.model.encoder.layers.0.self_attn.k_proj.weight", "t2u_model.model.encoder.layers.0.self_attn.k_proj.bias", "t2u_model.model.encoder.layers.0.self_attn.v_proj.weight", "t2u_model.model.encoder.layers.0.self_attn.v_proj.bias", "t2u_model.model.encoder.layers.0.self_attn.q_proj.weight", "t2u_model.model.encoder.layers.0.self_attn.q_proj.bias", "t2u_model.model.encoder.layers.0.self_attn.out_proj.weight", "t2u_model.model.encoder.layers.0.self_attn.out_proj.bias", "t2u_model.model.encoder.layers.0.self_attn_layer_norm.weight", "t2u_model.model.encoder.layers.0.self_attn_layer_norm.bias", "t2u_model.model.encoder.layers.0.ffn.fc1.weight", "t2u_model.model.encoder.layers.0.ffn.fc1.bias", "t2u_model.model.encoder.layers.0.ffn.fc2.weight", "t2u_model.model.encoder.layers.0.ffn.fc2.bias", "t2u_model.model.encoder.layers.0.ffn_layer_norm.weight", "t2u_model.model.encoder.layers.0.ffn_layer_norm.bias", "t2u_model.model.encoder.layers.1.self_attn.k_proj.weight", "t2u_model.model.encoder.layers.1.self_attn.k_proj.bias", "t2u_model.model.encoder.layers.1.self_attn.v_proj.weight", "t2u_model.model.encoder.layers.1.self_attn.v_proj.bias", "t2u_model.model.encoder.layers.1.self_attn.q_proj.weight", "t2u_model.model.encoder.layers.1.self_attn.q_proj.bias", "t2u_model.model.encoder.layers.1.self_attn.out_proj.weight", "t2u_model.model.encoder.layers.1.self_attn.out_proj.bias", "t2u_model.model.encoder.layers.1.self_attn_layer_norm.weight", "t2u_model.model.encoder.layers.1.self_attn_layer_norm.bias", "t2u_model.model.encoder.layers.1.ffn.fc1.weight", "t2u_model.model.encoder.layers.1.ffn.fc1.bias", "t2u_model.model.encoder.layers.1.ffn.fc2.weight", "t2u_model.model.encoder.layers.1.ffn.fc2.bias", "t2u_model.model.encoder.layers.1.ffn_layer_norm.weight", "t2u_model.model.encoder.layers.1.ffn_layer_norm.bias", 
...
Unexpected key(s) in state_dict: "model_name", "model".

Expected behavior

Expected to load the new fintuned model and then save it to a new model file.

huggingface / transformers

Cannot Load .pt model #31829

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior