huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
129.67k stars 25.76k forks source link

Cannot Load .pt model #31829

Open ivanhe123 opened 2 weeks ago

ivanhe123 commented 2 weeks ago

System Info

Python version 3.11

Who can help?

No response

Information

Tasks

Reproduction

  1. Finetuned model using https://www.kaggle.com/code/chlorinecl/notebook4101d69eb6
  2. Download .pt model and load it using
    import torch
    from transformers import AutoProcessor, SeamlessM4TModel
    new_model = torch.load("./expt4_m4tM.pt")
    processor = AutoProcessor.from_pretrained("seamless-m4t-medium")
    model_seam = SeamlessM4TModel.from_pretrained("seamless-m4t-medium")
    model_seam.load_state_dict(new_model)
    model_seam.save_pretrained("./new_seamless-m4t-medium")
  3. Outputs:
    D:\projects\GNNNER\venv\Lib\site-packages\transformers\deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
    warnings.warn(
    Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
    Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
    Traceback (most recent call last):
    File "D:\projects\GNNNER\convert_bin_to_pt.py", line 6, in <module>
    model_seam.load_state_dict(new_model)
    File "D:\projects\GNNNER\venv\Lib\site-packages\torch\nn\modules\module.py", line 2189, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
    RuntimeError: Error(s) in loading state_dict for SeamlessM4TModel:
    Missing key(s) in state_dict: "shared.weight", "text_encoder.embed_tokens.weight", "text_encoder.layers.0.self_attn.k_proj.weight", "text_encoder.layers.0.self_attn.k_proj.bias", "text_encoder.layers.0.self_attn.v_proj.weight", "text_encoder.layers.0.self_attn.v_proj.bias", "text_encoder.layers.0.self_attn.q_proj.weight", "text_encoder.layers.0.self_attn.q_proj.bias", "text_encoder.layers.0.self_attn.out_proj.weight", "text_encoder.layers.0.self_attn.out_proj.bias", "text_encoder.layers.0.self_attn_layer_norm.weight", "text_encoder.layers.0.self_attn_layer_norm.bias", "text_encoder.layers.0.ffn.fc1.weight", "text_encoder.layers.0.ffn.fc1.bias", "text_encoder.layers.0.ffn.fc2.weight", "text_encoder.layers.0.ffn.fc2.bias", "text_encoder.layers.0.ffn_layer_norm.weight", "text_encoder.layers.0.ffn_layer_norm.bias", "text_encoder.layers.1.self_attn.k_proj.weight", "text_encoder.layers.1.self_attn.k_proj.bias", "text_encoder.layers.1.self_attn.v_proj.weight", "text_encoder.layers.1.self_attn.v_proj.bias", "text_encoder.layers.1.self_attn.q_proj.weight", "text_encoder.layers.1.self_attn.q_proj.bias", "text_encoder.layers.1.self_attn.out_proj.weight", "text_encoder.layers.1.self_attn.out_proj.bias", "text_encoder.layers.1.self_attn_layer_norm.weight", "text_encoder.layers.1.self_attn_layer_norm.bias", "text_encoder.layers.1.ffn.fc1.weight", "text_encoder.layers.1.ffn.fc1.bias", "text_encoder.layers.1.ffn.fc2.weight", "text_encoder.layers.1.ffn.fc2.bias", "text_encoder.layers.1.ffn_layer_norm.weight", "text_encoder.layers.1.ffn_layer_norm.bias", "text_encoder.layers.2.self_attn.k_proj.weight", "text_encoder.layers.2.self_attn.k_proj.bias", "text_encoder.layers.2.self_attn.v_proj.weight", "text_encoder.layers.2.self_attn.v_proj.bias", "text_encoder.layers.2.self_attn.q_proj.weight", "text_encoder.layers.2.self_attn.q_proj.bias", "text_encoder.layers.2.self_attn.out_proj.weight", "text_encoder.layers.2.self_attn.out_proj.bias", "text_encoder.layers.2.self_attn_layer_norm.weight", "text_encoder.layers.2.self_attn_layer_norm.bias", "text_encoder.layers.2.ffn.fc1.weight", "text_encoder.layers.2.ffn.fc1.bias", "text_encoder.layers.2.ffn.fc2.weight", "text_encoder.layers.2.ffn.fc2.bias", "text_encoder.layers.2.ffn_layer_norm.weight", "text_encoder.layers.2.ffn_layer_norm.bias", "text_encoder.layers.3.self_attn.k_proj.weight", "text_encoder.layers.3.self_attn.k_proj.bias", "text_encoder.layers.3.self_attn.v_proj.weight", "text_encoder.layers.3.self_attn.v_proj.bias", "text_encoder.layers.3.self_attn.q_proj.weight", "text_encoder.layers.3.self_attn.q_proj.bias", "text_encoder.layers.3.self_attn.out_proj.weight", "text_encoder.layers.3.self_attn.out_proj.bias", "text_encoder.layers.3.self_attn_layer_norm.weight", "text_encoder.layers.3.self_attn_layer_norm.bias", "text_encoder.layers.3.ffn.fc1.weight", "text_encoder.layers.3.ffn.fc1.bias", "text_encoder.layers.3.ffn.fc2.weight", "text_encoder.layers.3.ffn.fc2.bias", "text_encoder.layers.3.ffn_layer_norm.weight", "text_encoder.layers.3.ffn_layer_norm.bias", "text_encoder.layers.4.self_attn.k_proj.weight", "text_encoder.layers.4.self_attn.k_proj.bias", "text_encoder.layers.4.self_attn.v_proj.weight", "text_encoder.layers.4.self_attn.v_proj.bias", "text_encoder.layers.4.self_attn.q_proj.weight", "text_encoder.layers.4.self_attn.q_proj.bias", "text_encoder.layers.4.self_attn.out_proj.weight", "text_encoder.layers.4.self_attn.out_proj.bias", "text_encoder.layers.4.self_attn_layer_norm.weight", "text_encoder.layers.4.self_attn_layer_norm.bias", "text_encoder.layers.4.ffn.fc1.weight", "text_encoder.layers.4.ffn.fc1.bias", "text_encoder.layers.4.ffn.fc2.weight", "text_encoder.layers.4.ffn.fc2.bias", "text_encoder.layers.4.ffn_layer_norm.weight", "text_encoder.layers.4.ffn_layer_norm.bias", "text_encoder.layers.5.self_attn.k_proj.weight", "text_encoder.layers.5.self_attn.k_proj.bias", "text_encoder.layers.5.self_attn.v_proj.weight", "text_encoder.layers.5.self_attn.v_proj.bias", "text_encoder.layers.5.self_attn.q_proj.weight", "text_encoder.layers.5.self_attn.q_proj.bias", "text_encoder.layers.5.self_attn.out_proj.weight", "text_encoder.layers.5.self_attn.out_proj.bias", "text_encoder.layers.5.self_attn_layer_norm.weight", "text_encoder.layers.5.self_attn_layer_norm.bias", "text_encoder.layers.5.ffn.fc1.weight", "text_encoder.layers.5.ffn.fc1.bias", "text_encoder.layers.5.ffn.fc2.weight", "text_encoder.layers.5.ffn.fc2.bias", "text_encoder.layers.5.ffn_layer_norm.weight", "text_encoder.layers.5.ffn_layer_norm.bias", "text_encoder.layers.6.self_attn.k_proj.weight", "text_encoder.layers.6.self_attn.k_proj.bias", "text_encoder.layers.6.self_attn.v_proj.weight", "text_encoder.layers.6.self_attn.v_proj.bias", "text_encoder.layers.6.self_attn.q_proj.weight", "text_encoder.layers.6.self_attn.q_proj.bias", "text_encoder.layers.6.self_attn.out_proj.weight", "text_encoder.layers.6.self_attn.out_proj.bias", "text_encoder.layers.6.self_attn_layer_norm.weight", "text_encoder.layers.6.self_attn_layer_norm.bias", "text_encoder.layers.6.ffn.fc1.weight", "text_encoder.layers.6.ffn.fc1.bias", "text_encoder.layers.6.ffn.fc2.weight", "text_encoder.layers.6.ffn.fc2.bias", "text_encoder.layers.6.ffn_layer_norm.weight", "text_encoder.layers.6.ffn_layer_norm.bias", "text_encoder.layers.7.self_attn.k_proj.weight", "text_encoder.layers.7.self_attn.k_proj.bias", "text_encoder.layers.7.self_attn.v_proj.weight", "text_encoder.layers.7.self_attn.v_proj.bias", "text_encoder.layers.7.self_attn.q_proj.weight", "text_encoder.layers.7.self_attn.q_proj.bias", "text_encoder.layers.7.self_attn.out_proj.weight", "text_encoder.layers.7.self_attn.out_proj.bias", "text_encoder.layers.7.self_attn_layer_norm.weight", "text_encoder.layers.7.self_attn_layer_norm.bias", "text_encoder.layers.7.ffn.fc1.weight", "text_encoder.layers.7.ffn.fc1.bias", "text_encoder.layers.7.ffn.fc2.weight", "text_encoder.layers.7.ffn.fc2.bias", "text_encoder.layers.7.ffn_layer_norm.weight", "text_encoder.layers.7.ffn_layer_norm.bias", "text_encoder.layers.8.self_attn.k_proj.weight", "text_encoder.layers.8.self_attn.k_proj.bias", "text_encoder.layers.8.self_attn.v_proj.weight", "text_encoder.layers.8.self_attn.v_proj.bias", "text_encoder.layers.8.self_attn.q_proj.weight", "text_encoder.layers.8.self_attn.q_proj.bias", "text_encoder.layers.8.self_attn.out_proj.weight", "text_encoder.layers.8.self_attn.out_proj.bias", "text_encoder.layers.8.self_attn_layer_norm.weight", "text_encoder.layers.8.self_attn_layer_norm.bias", "text_encoder.layers.8.ffn.fc1.weight", "text_encoder.layers.8.ffn.fc1.bias", "text_encoder.layers.8.ffn.fc2.weight", "text_encoder.layers.8.ffn.fc2.bias", "text_encoder.layers.8.ffn_layer_norm.weight", "text_encoder.layers.8.ffn_layer_norm.bias", "text_encoder.layers.9.self_attn.k_proj.weight", "text_encoder.layers.9.self_attn.k_proj.bias", "text_encoder.layers.9.self_attn.v_proj.weight", "text_encoder.layers.9.self_attn.v_proj.bias", "text_encoder.layers.9.self_attn.q_proj.weight", "text_encoder.layers.9.self_attn.q_proj.bias", "text_encoder.layers.9.self_attn.out_proj.weight", "text_encoder.layers.9.self_attn.out_proj.bias", "text_encoder.layers.9.self_attn_layer_norm.weight", "text_encoder.layers.9.self_attn_layer_norm.bias", "text_encoder.layers.9.ffn.fc1.weight", "text_encoder.layers.9.ffn.fc1.bias", "text_encoder.layers.9.ffn.fc2.weight", "text_encoder.layers.9.ffn.fc2.bias", "text_encoder.layers.9.ffn_layer_norm.weight", "text_encoder.layers.9.ffn_layer_norm.bias", "text_encoder.layers.10.self_attn.k_proj.weight", "text_encoder.layers.10.self_attn.k_proj.bias", "text_encoder.layers.10.self_attn.v_proj.weight", "text_encoder.layers.10.self_attn.v_proj.bias", "text_encoder.layers.10.self_attn.q_proj.weight", "text_encoder.layers.10.self_attn.q_proj.bias", "text_encoder.layers.10.self_attn.out_proj.weight", "text_encoder.layers.10.self_attn.out_proj.bias", "text_encoder.layers.10.self_attn_layer_norm.weight", "text_encoder.layers.10.self_attn_layer_norm.bias", "text_encoder.layers.10.ffn.fc1.weight", "text_encoder.layers.10.ffn.fc1.bias", "text_encoder.layers.10.ffn.fc2.weight", "text_encoder.layers.10.ffn.fc2.bias", "text_encoder.layers.10.ffn_layer_norm.weight", "text_encoder.layers.10.ffn_layer_norm.bias", "text_encoder.layers.11.self_attn.k_proj.weight", "text_encoder.layers.11.self_attn.k_proj.bias", "text_encoder.layers.11.self_attn.v_proj.weight", "text_encoder.layers.11.self_attn.v_proj.bias", "text_encoder.layers.11.self_attn.q_proj.weight", "text_encoder.layers.11.self_attn.q_proj.bias", "text_encoder.layers.11.self_attn.out_proj.weight", "text_encoder.layers.11.self_attn.out_proj.bias", "text_encoder.layers.11.self_attn_layer_norm.weight", "text_encoder.layers.11.self_attn_layer_norm.bias", "text_encoder.layers.11.ffn.fc1.weight", "text_encoder.layers.11.ffn.fc1.bias", "text_encoder.layers.11.ffn.fc2.weight", "text_encoder.layers.11.ffn.fc2.bias", "text_encoder.layers.11.ffn_layer_norm.weight", "text_encoder.layers.11.ffn_layer_norm.bias", "text_encoder.layer_norm.weight", "text_encoder.layer_norm.bias", "speech_encoder.feature_projection.layer_norm.weight", "speech_encoder.feature_projection.layer_norm.bias", "speech_encoder.feature_projection.projection.weight", "speech_encoder.feature_projection.projection.bias", "speech_encoder.encoder.layers.0.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.0.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.0.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.0.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.0.ffn1.output_dense.weight", "speech_encoder.encoder.layers.0.ffn1.output_dense.bias", "speech_encoder.encoder.layers.0.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.0.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.0.self_attn.pos_bias_u", "speech_encoder.encoder.layers.0.self_attn.pos_bias_v", "speech_encoder.encoder.layers.0.self_attn.linear_q.weight", "speech_encoder.encoder.layers.0.self_attn.linear_q.bias", "speech_encoder.encoder.layers.0.self_attn.linear_k.weight", "speech_encoder.encoder.layers.0.self_attn.linear_k.bias", "speech_encoder.encoder.layers.0.self_attn.linear_v.weight", "speech_encoder.encoder.layers.0.self_attn.linear_v.bias", "speech_encoder.encoder.layers.0.self_attn.linear_out.weight", "speech_encoder.encoder.layers.0.self_attn.linear_out.bias", "speech_encoder.encoder.layers.0.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.0.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.0.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.0.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.0.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.0.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.0.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.0.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.0.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.0.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.0.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.0.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.0.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.0.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.0.ffn2.output_dense.weight", "speech_encoder.encoder.layers.0.ffn2.output_dense.bias", "speech_encoder.encoder.layers.0.final_layer_norm.weight", "speech_encoder.encoder.layers.0.final_layer_norm.bias", "speech_encoder.encoder.layers.1.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.1.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.1.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.1.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.1.ffn1.output_dense.weight", "speech_encoder.encoder.layers.1.ffn1.output_dense.bias", "speech_encoder.encoder.layers.1.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.1.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.1.self_attn.pos_bias_u", "speech_encoder.encoder.layers.1.self_attn.pos_bias_v", "speech_encoder.encoder.layers.1.self_attn.linear_q.weight", "speech_encoder.encoder.layers.1.self_attn.linear_q.bias", "speech_encoder.encoder.layers.1.self_attn.linear_k.weight", "speech_encoder.encoder.layers.1.self_attn.linear_k.bias", "speech_encoder.encoder.layers.1.self_attn.linear_v.weight", "speech_encoder.encoder.layers.1.self_attn.linear_v.bias", "speech_encoder.encoder.layers.1.self_attn.linear_out.weight", "speech_encoder.encoder.layers.1.self_attn.linear_out.bias", "speech_encoder.encoder.layers.1.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.1.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.1.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.1.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.1.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.1.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.1.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.1.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.1.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.1.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.1.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.1.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.1.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.1.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.1.ffn2.output_dense.weight", "speech_encoder.encoder.layers.1.ffn2.output_dense.bias", "speech_encoder.encoder.layers.1.final_layer_norm.weight", "speech_encoder.encoder.layers.1.final_layer_norm.bias", "speech_encoder.encoder.layers.2.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.2.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.2.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.2.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.2.ffn1.output_dense.weight", "speech_encoder.encoder.layers.2.ffn1.output_dense.bias", "speech_encoder.encoder.layers.2.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.2.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.2.self_attn.pos_bias_u", "speech_encoder.encoder.layers.2.self_attn.pos_bias_v", "speech_encoder.encoder.layers.2.self_attn.linear_q.weight", "speech_encoder.encoder.layers.2.self_attn.linear_q.bias", "speech_encoder.encoder.layers.2.self_attn.linear_k.weight", "speech_encoder.encoder.layers.2.self_attn.linear_k.bias", "speech_encoder.encoder.layers.2.self_attn.linear_v.weight", "speech_encoder.encoder.layers.2.self_attn.linear_v.bias", "speech_encoder.encoder.layers.2.self_attn.linear_out.weight", "speech_encoder.encoder.layers.2.self_attn.linear_out.bias", "speech_encoder.encoder.layers.2.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.2.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.2.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.2.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.2.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.2.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.2.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.2.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.2.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.2.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.2.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.2.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.2.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.2.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.2.ffn2.output_dense.weight", "speech_encoder.encoder.layers.2.ffn2.output_dense.bias", "speech_encoder.encoder.layers.2.final_layer_norm.weight", "speech_encoder.encoder.layers.2.final_layer_norm.bias", "speech_encoder.encoder.layers.3.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.3.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.3.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.3.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.3.ffn1.output_dense.weight", "speech_encoder.encoder.layers.3.ffn1.output_dense.bias", "speech_encoder.encoder.layers.3.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.3.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.3.self_attn.pos_bias_u", "speech_encoder.encoder.layers.3.self_attn.pos_bias_v", "speech_encoder.encoder.layers.3.self_attn.linear_q.weight", "speech_encoder.encoder.layers.3.self_attn.linear_q.bias", "speech_encoder.encoder.layers.3.self_attn.linear_k.weight", "speech_encoder.encoder.layers.3.self_attn.linear_k.bias", "speech_encoder.encoder.layers.3.self_attn.linear_v.weight", "speech_encoder.encoder.layers.3.self_attn.linear_v.bias", "speech_encoder.encoder.layers.3.self_attn.linear_out.weight", "speech_encoder.encoder.layers.3.self_attn.linear_out.bias", "speech_encoder.encoder.layers.3.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.3.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.3.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.3.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.3.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.3.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.3.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.3.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.3.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.3.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.3.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.3.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.3.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.3.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.3.ffn2.output_dense.weight", "speech_encoder.encoder.layers.3.ffn2.output_dense.bias", "speech_encoder.encoder.layers.3.final_layer_norm.weight", "speech_encoder.encoder.layers.3.final_layer_norm.bias", "speech_encoder.encoder.layers.4.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.4.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.4.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.4.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.4.ffn1.output_dense.weight", "speech_encoder.encoder.layers.4.ffn1.output_dense.bias", "speech_encoder.encoder.layers.4.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.4.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.4.self_attn.pos_bias_u", "speech_encoder.encoder.layers.4.self_attn.pos_bias_v", "speech_encoder.encoder.layers.4.self_attn.linear_q.weight", "speech_encoder.encoder.layers.4.self_attn.linear_q.bias", "speech_encoder.encoder.layers.4.self_attn.linear_k.weight", "speech_encoder.encoder.layers.4.self_attn.linear_k.bias", "speech_encoder.encoder.layers.4.self_attn.linear_v.weight", "speech_encoder.encoder.layers.4.self_attn.linear_v.bias", "speech_encoder.encoder.layers.4.self_attn.linear_out.weight", "speech_encoder.encoder.layers.4.self_attn.linear_out.bias", "speech_encoder.encoder.layers.4.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.4.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.4.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.4.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.4.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.4.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.4.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.4.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.4.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.4.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.4.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.4.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.4.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.4.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.4.ffn2.output_dense.weight", "speech_encoder.encoder.layers.4.ffn2.output_dense.bias", "speech_encoder.encoder.layers.4.final_layer_norm.weight", "speech_encoder.encoder.layers.4.final_layer_norm.bias", "speech_encoder.encoder.layers.5.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.5.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.5.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.5.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.5.ffn1.output_dense.weight", "speech_encoder.encoder.layers.5.ffn1.output_dense.bias", "speech_encoder.encoder.layers.5.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.5.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.5.self_attn.pos_bias_u", "speech_encoder.encoder.layers.5.self_attn.pos_bias_v", "speech_encoder.encoder.layers.5.self_attn.linear_q.weight", "speech_encoder.encoder.layers.5.self_attn.linear_q.bias", "speech_encoder.encoder.layers.5.self_attn.linear_k.weight", "speech_encoder.encoder.layers.5.self_attn.linear_k.bias", "speech_encoder.encoder.layers.5.self_attn.linear_v.weight", "speech_encoder.encoder.layers.5.self_attn.linear_v.bias", "speech_encoder.encoder.layers.5.self_attn.linear_out.weight", "speech_encoder.encoder.layers.5.self_attn.linear_out.bias", "speech_encoder.encoder.layers.5.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.5.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.5.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.5.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.5.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.5.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.5.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.5.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.5.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.5.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.5.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.5.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.5.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.5.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.5.ffn2.output_dense.weight", "speech_encoder.encoder.layers.5.ffn2.output_dense.bias", "speech_encoder.encoder.layers.5.final_layer_norm.weight", "speech_encoder.encoder.layers.5.final_layer_norm.bias", "speech_encoder.encoder.layers.6.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.6.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.6.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.6.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.6.ffn1.output_dense.weight", "speech_encoder.encoder.layers.6.ffn1.output_dense.bias", "speech_encoder.encoder.layers.6.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.6.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.6.self_attn.pos_bias_u", "speech_encoder.encoder.layers.6.self_attn.pos_bias_v", "speech_encoder.encoder.layers.6.self_attn.linear_q.weight", "speech_encoder.encoder.layers.6.self_attn.linear_q.bias", "speech_encoder.encoder.layers.6.self_attn.linear_k.weight", "speech_encoder.encoder.layers.6.self_attn.linear_k.bias", "speech_encoder.encoder.layers.6.self_attn.linear_v.weight", "speech_encoder.encoder.layers.6.self_attn.linear_v.bias", "speech_encoder.encoder.layers.6.self_attn.linear_out.weight", "speech_encoder.encoder.layers.6.self_attn.linear_out.bias", "speech_encoder.encoder.layers.6.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.6.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.6.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.6.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.6.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.6.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.6.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.6.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.6.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.6.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.6.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.6.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.6.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.6.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.6.ffn2.output_dense.weight", "speech_encoder.encoder.layers.6.ffn2.output_dense.bias", "speech_encoder.encoder.layers.6.final_layer_norm.weight", "speech_encoder.encoder.layers.6.final_layer_norm.bias", "speech_encoder.encoder.layers.7.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.7.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.7.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.7.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.7.ffn1.output_dense.weight", "speech_encoder.encoder.layers.7.ffn1.output_dense.bias", "speech_encoder.encoder.layers.7.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.7.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.7.self_attn.pos_bias_u", "speech_encoder.encoder.layers.7.self_attn.pos_bias_v", "speech_encoder.encoder.layers.7.self_attn.linear_q.weight", "speech_encoder.encoder.layers.7.self_attn.linear_q.bias", "speech_encoder.encoder.layers.7.self_attn.linear_k.weight", "speech_encoder.encoder.layers.7.self_attn.linear_k.bias", "speech_encoder.encoder.layers.7.self_attn.linear_v.weight", "speech_encoder.encoder.layers.7.self_attn.linear_v.bias", "speech_encoder.encoder.layers.7.self_attn.linear_out.weight", "speech_encoder.encoder.layers.7.self_attn.linear_out.bias", "speech_encoder.encoder.layers.7.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.7.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.7.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.7.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.7.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.7.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.7.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.7.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.7.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.7.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.7.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.7.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.7.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.7.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.7.ffn2.output_dense.weight", "speech_encoder.encoder.layers.7.ffn2.output_dense.bias", "speech_encoder.encoder.layers.7.final_layer_norm.weight", "speech_encoder.encoder.layers.7.final_layer_norm.bias", "speech_encoder.encoder.layers.8.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.8.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.8.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.8.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.8.ffn1.output_dense.weight", "speech_encoder.encoder.layers.8.ffn1.output_dense.bias", "speech_encoder.encoder.layers.8.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.8.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.8.self_attn.pos_bias_u", "speech_encoder.encoder.layers.8.self_attn.pos_bias_v", "speech_encoder.encoder.layers.8.self_attn.linear_q.weight", "speech_encoder.encoder.layers.8.self_attn.linear_q.bias", "speech_encoder.encoder.layers.8.self_attn.linear_k.weight", "speech_encoder.encoder.layers.8.self_attn.linear_k.bias", "speech_encoder.encoder.layers.8.self_attn.linear_v.weight", "speech_encoder.encoder.layers.8.self_attn.linear_v.bias", "speech_encoder.encoder.layers.8.self_attn.linear_out.weight", "speech_encoder.encoder.layers.8.self_attn.linear_out.bias", "speech_encoder.encoder.layers.8.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.8.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.8.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.8.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.8.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.8.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.8.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.8.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.8.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.8.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.8.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.8.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.8.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.8.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.8.ffn2.output_dense.weight", "speech_encoder.encoder.layers.8.ffn2.output_dense.bias", "speech_encoder.encoder.layers.8.final_layer_norm.weight", "speech_encoder.encoder.layers.8.final_layer_norm.bias", "speech_encoder.encoder.layers.9.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.9.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.9.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.9.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.9.ffn1.output_dense.weight", "speech_encoder.encoder.layers.9.ffn1.output_dense.bias", "speech_encoder.encoder.layers.9.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.9.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.9.self_attn.pos_bias_u", "speech_encoder.encoder.layers.9.self_attn.pos_bias_v", "speech_encoder.encoder.layers.9.self_attn.linear_q.weight", "speech_encoder.encoder.layers.9.self_attn.linear_q.bias", "speech_encoder.encoder.layers.9.self_attn.linear_k.weight", "speech_encoder.encoder.layers.9.self_attn.linear_k.bias", "speech_encoder.encoder.layers.9.self_attn.linear_v.weight", "speech_encoder.encoder.layers.9.self_attn.linear_v.bias", "speech_encoder.encoder.layers.9.self_attn.linear_out.weight", "speech_encoder.encoder.layers.9.self_attn.linear_out.bias", "speech_encoder.encoder.layers.9.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.9.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.9.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.9.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.9.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.9.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.9.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.9.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.9.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.9.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.9.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.9.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.9.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.9.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.9.ffn2.output_dense.weight", "speech_encoder.encoder.layers.9.ffn2.output_dense.bias", "speech_encoder.encoder.layers.9.final_layer_norm.weight", "speech_encoder.encoder.layers.9.final_layer_norm.bias", "speech_encoder.encoder.layers.10.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.10.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.10.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.10.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.10.ffn1.output_dense.weight", "speech_encoder.encoder.layers.10.ffn1.output_dense.bias", "speech_encoder.encoder.layers.10.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.10.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.10.self_attn.pos_bias_u", "speech_encoder.encoder.layers.10.self_attn.pos_bias_v", "speech_encoder.encoder.layers.10.self_attn.linear_q.weight", "speech_encoder.encoder.layers.10.self_attn.linear_q.bias", "speech_encoder.encoder.layers.10.self_attn.linear_k.weight", "speech_encoder.encoder.layers.10.self_attn.linear_k.bias", "speech_encoder.encoder.layers.10.self_attn.linear_v.weight", "speech_encoder.encoder.layers.10.self_attn.linear_v.bias", "speech_encoder.encoder.layers.10.self_attn.linear_out.weight", "speech_encoder.encoder.layers.10.self_attn.linear_out.bias", "speech_encoder.encoder.layers.10.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.10.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.10.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.10.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.10.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.10.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.10.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.10.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.10.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.10.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.10.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.10.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.10.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.10.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.10.ffn2.output_dense.weight", "speech_encoder.encoder.layers.10.ffn2.output_dense.bias", "speech_encoder.encoder.layers.10.final_layer_norm.weight", "speech_encoder.encoder.layers.10.final_layer_norm.bias", "speech_encoder.encoder.layers.11.ffn1_layer_norm.weight", "speech_encoder.encoder.layers.11.ffn1_layer_norm.bias", "speech_encoder.encoder.layers.11.ffn1.intermediate_dense.weight", "speech_encoder.encoder.layers.11.ffn1.intermediate_dense.bias", "speech_encoder.encoder.layers.11.ffn1.output_dense.weight", "speech_encoder.encoder.layers.11.ffn1.output_dense.bias", "speech_encoder.encoder.layers.11.self_attn_layer_norm.weight", "speech_encoder.encoder.layers.11.self_attn_layer_norm.bias", "speech_encoder.encoder.layers.11.self_attn.pos_bias_u", "speech_encoder.encoder.layers.11.self_attn.pos_bias_v", "speech_encoder.encoder.layers.11.self_attn.linear_q.weight", "speech_encoder.encoder.layers.11.self_attn.linear_q.bias", "speech_encoder.encoder.layers.11.self_attn.linear_k.weight", "speech_encoder.encoder.layers.11.self_attn.linear_k.bias", "speech_encoder.encoder.layers.11.self_attn.linear_v.weight", "speech_encoder.encoder.layers.11.self_attn.linear_v.bias", "speech_encoder.encoder.layers.11.self_attn.linear_out.weight", "speech_encoder.encoder.layers.11.self_attn.linear_out.bias", "speech_encoder.encoder.layers.11.self_attn.linear_pos.weight", "speech_encoder.encoder.layers.11.conv_module.layer_norm.weight", "speech_encoder.encoder.layers.11.conv_module.layer_norm.bias", "speech_encoder.encoder.layers.11.conv_module.pointwise_conv1.weight", "speech_encoder.encoder.layers.11.conv_module.depthwise_conv.weight", "speech_encoder.encoder.layers.11.conv_module.batch_norm.weight", "speech_encoder.encoder.layers.11.conv_module.batch_norm.bias", "speech_encoder.encoder.layers.11.conv_module.batch_norm.running_mean", "speech_encoder.encoder.layers.11.conv_module.batch_norm.running_var", "speech_encoder.encoder.layers.11.conv_module.pointwise_conv2.weight", "speech_encoder.encoder.layers.11.ffn2_layer_norm.weight", "speech_encoder.encoder.layers.11.ffn2_layer_norm.bias", "speech_encoder.encoder.layers.11.ffn2.intermediate_dense.weight", "speech_encoder.encoder.layers.11.ffn2.intermediate_dense.bias", "speech_encoder.encoder.layers.11.ffn2.output_dense.weight", "speech_encoder.encoder.layers.11.ffn2.output_dense.bias", "speech_encoder.encoder.layers.11.final_layer_norm.weight", "speech_encoder.encoder.layers.11.final_layer_norm.bias", "speech_encoder.encoder.layer_norm.weight", "speech_encoder.encoder.layer_norm.bias", "speech_encoder.intermediate_ffn.intermediate_dense.weight", "speech_encoder.intermediate_ffn.intermediate_dense.bias", "speech_encoder.intermediate_ffn.output_dense.weight", "speech_encoder.intermediate_ffn.output_dense.bias", "speech_encoder.adapter.layers.0.residual_layer_norm.weight", "speech_encoder.adapter.layers.0.residual_layer_norm.bias", "speech_encoder.adapter.layers.0.residual_conv.weight", "speech_encoder.adapter.layers.0.residual_conv.bias", "speech_encoder.adapter.layers.0.self_attn_layer_norm.weight", "speech_encoder.adapter.layers.0.self_attn_layer_norm.bias", "speech_encoder.adapter.layers.0.self_attn_conv.weight", "speech_encoder.adapter.layers.0.self_attn_conv.bias", "speech_encoder.adapter.layers.0.self_attn.linear_q.weight", "speech_encoder.adapter.layers.0.self_attn.linear_q.bias", "speech_encoder.adapter.layers.0.self_attn.linear_k.weight", "speech_encoder.adapter.layers.0.self_attn.linear_k.bias", "speech_encoder.adapter.layers.0.self_attn.linear_v.weight", "speech_encoder.adapter.layers.0.self_attn.linear_v.bias", "speech_encoder.adapter.layers.0.self_attn.linear_out.weight", "speech_encoder.adapter.layers.0.self_attn.linear_out.bias", "speech_encoder.adapter.layers.0.ffn_layer_norm.weight", "speech_encoder.adapter.layers.0.ffn_layer_norm.bias", "speech_encoder.adapter.layers.0.ffn.intermediate_dense.weight", "speech_encoder.adapter.layers.0.ffn.intermediate_dense.bias", "speech_encoder.adapter.layers.0.ffn.output_dense.weight", "speech_encoder.adapter.layers.0.ffn.output_dense.bias", "speech_encoder.inner_layer_norm.weight", "speech_encoder.inner_layer_norm.bias", "text_decoder.embed_tokens.weight", "text_decoder.layers.0.self_attn.k_proj.weight", "text_decoder.layers.0.self_attn.k_proj.bias", "text_decoder.layers.0.self_attn.v_proj.weight", "text_decoder.layers.0.self_attn.v_proj.bias", "text_decoder.layers.0.self_attn.q_proj.weight", "text_decoder.layers.0.self_attn.q_proj.bias", "text_decoder.layers.0.self_attn.out_proj.weight", "text_decoder.layers.0.self_attn.out_proj.bias", "text_decoder.layers.0.self_attn_layer_norm.weight", "text_decoder.layers.0.self_attn_layer_norm.bias", "text_decoder.layers.0.cross_attention.k_proj.weight", "text_decoder.layers.0.cross_attention.k_proj.bias", "text_decoder.layers.0.cross_attention.v_proj.weight", "text_decoder.layers.0.cross_attention.v_proj.bias", "text_decoder.layers.0.cross_attention.q_proj.weight", "text_decoder.layers.0.cross_attention.q_proj.bias", "text_decoder.layers.0.cross_attention.out_proj.weight", "text_decoder.layers.0.cross_attention.out_proj.bias", "text_decoder.layers.0.cross_attention_layer_norm.weight", "text_decoder.layers.0.cross_attention_layer_norm.bias", "text_decoder.layers.0.ffn.fc1.weight", "text_decoder.layers.0.ffn.fc1.bias", "text_decoder.layers.0.ffn.fc2.weight", "text_decoder.layers.0.ffn.fc2.bias", "text_decoder.layers.0.ffn_layer_norm.weight", "text_decoder.layers.0.ffn_layer_norm.bias", "text_decoder.layers.1.self_attn.k_proj.weight", "text_decoder.layers.1.self_attn.k_proj.bias", "text_decoder.layers.1.self_attn.v_proj.weight", "text_decoder.layers.1.self_attn.v_proj.bias", "text_decoder.layers.1.self_attn.q_proj.weight", "text_decoder.layers.1.self_attn.q_proj.bias", "text_decoder.layers.1.self_attn.out_proj.weight", "text_decoder.layers.1.self_attn.out_proj.bias", "text_decoder.layers.1.self_attn_layer_norm.weight", "text_decoder.layers.1.self_attn_layer_norm.bias", "text_decoder.layers.1.cross_attention.k_proj.weight", "text_decoder.layers.1.cross_attention.k_proj.bias", "text_decoder.layers.1.cross_attention.v_proj.weight", "text_decoder.layers.1.cross_attention.v_proj.bias", "text_decoder.layers.1.cross_attention.q_proj.weight", "text_decoder.layers.1.cross_attention.q_proj.bias", "text_decoder.layers.1.cross_attention.out_proj.weight", "text_decoder.layers.1.cross_attention.out_proj.bias", "text_decoder.layers.1.cross_attention_layer_norm.weight", "text_decoder.layers.1.cross_attention_layer_norm.bias", "text_decoder.layers.1.ffn.fc1.weight", "text_decoder.layers.1.ffn.fc1.bias", "text_decoder.layers.1.ffn.fc2.weight", "text_decoder.layers.1.ffn.fc2.bias", "text_decoder.layers.1.ffn_layer_norm.weight", "text_decoder.layers.1.ffn_layer_norm.bias", "text_decoder.layers.2.self_attn.k_proj.weight", "text_decoder.layers.2.self_attn.k_proj.bias", "text_decoder.layers.2.self_attn.v_proj.weight", "text_decoder.layers.2.self_attn.v_proj.bias", "text_decoder.layers.2.self_attn.q_proj.weight", "text_decoder.layers.2.self_attn.q_proj.bias", "text_decoder.layers.2.self_attn.out_proj.weight", "text_decoder.layers.2.self_attn.out_proj.bias", "text_decoder.layers.2.self_attn_layer_norm.weight", "text_decoder.layers.2.self_attn_layer_norm.bias", "text_decoder.layers.2.cross_attention.k_proj.weight", "text_decoder.layers.2.cross_attention.k_proj.bias", "text_decoder.layers.2.cross_attention.v_proj.weight", "text_decoder.layers.2.cross_attention.v_proj.bias", "text_decoder.layers.2.cross_attention.q_proj.weight", "text_decoder.layers.2.cross_attention.q_proj.bias", "text_decoder.layers.2.cross_attention.out_proj.weight", "text_decoder.layers.2.cross_attention.out_proj.bias", "text_decoder.layers.2.cross_attention_layer_norm.weight", "text_decoder.layers.2.cross_attention_layer_norm.bias", "text_decoder.layers.2.ffn.fc1.weight", "text_decoder.layers.2.ffn.fc1.bias", "text_decoder.layers.2.ffn.fc2.weight", "text_decoder.layers.2.ffn.fc2.bias", "text_decoder.layers.2.ffn_layer_norm.weight", "text_decoder.layers.2.ffn_layer_norm.bias", "text_decoder.layers.3.self_attn.k_proj.weight", "text_decoder.layers.3.self_attn.k_proj.bias", "text_decoder.layers.3.self_attn.v_proj.weight", "text_decoder.layers.3.self_attn.v_proj.bias", "text_decoder.layers.3.self_attn.q_proj.weight", "text_decoder.layers.3.self_attn.q_proj.bias", "text_decoder.layers.3.self_attn.out_proj.weight", "text_decoder.layers.3.self_attn.out_proj.bias", "text_decoder.layers.3.self_attn_layer_norm.weight", "text_decoder.layers.3.self_attn_layer_norm.bias", "text_decoder.layers.3.cross_attention.k_proj.weight", "text_decoder.layers.3.cross_attention.k_proj.bias", "text_decoder.layers.3.cross_attention.v_proj.weight", "text_decoder.layers.3.cross_attention.v_proj.bias", "text_decoder.layers.3.cross_attention.q_proj.weight", "text_decoder.layers.3.cross_attention.q_proj.bias", "text_decoder.layers.3.cross_attention.out_proj.weight", "text_decoder.layers.3.cross_attention.out_proj.bias", "text_decoder.layers.3.cross_attention_layer_norm.weight", "text_decoder.layers.3.cross_attention_layer_norm.bias", "text_decoder.layers.3.ffn.fc1.weight", "text_decoder.layers.3.ffn.fc1.bias", "text_decoder.layers.3.ffn.fc2.weight", "text_decoder.layers.3.ffn.fc2.bias", "text_decoder.layers.3.ffn_layer_norm.weight", "text_decoder.layers.3.ffn_layer_norm.bias", "text_decoder.layers.4.self_attn.k_proj.weight", "text_decoder.layers.4.self_attn.k_proj.bias", "text_decoder.layers.4.self_attn.v_proj.weight", "text_decoder.layers.4.self_attn.v_proj.bias", "text_decoder.layers.4.self_attn.q_proj.weight", "text_decoder.layers.4.self_attn.q_proj.bias", "text_decoder.layers.4.self_attn.out_proj.weight", "text_decoder.layers.4.self_attn.out_proj.bias", "text_decoder.layers.4.self_attn_layer_norm.weight", "text_decoder.layers.4.self_attn_layer_norm.bias", "text_decoder.layers.4.cross_attention.k_proj.weight", "text_decoder.layers.4.cross_attention.k_proj.bias", "text_decoder.layers.4.cross_attention.v_proj.weight", "text_decoder.layers.4.cross_attention.v_proj.bias", "text_decoder.layers.4.cross_attention.q_proj.weight", "text_decoder.layers.4.cross_attention.q_proj.bias", "text_decoder.layers.4.cross_attention.out_proj.weight", "text_decoder.layers.4.cross_attention.out_proj.bias", "text_decoder.layers.4.cross_attention_layer_norm.weight", "text_decoder.layers.4.cross_attention_layer_norm.bias", "text_decoder.layers.4.ffn.fc1.weight", "text_decoder.layers.4.ffn.fc1.bias", "text_decoder.layers.4.ffn.fc2.weight", "text_decoder.layers.4.ffn.fc2.bias", "text_decoder.layers.4.ffn_layer_norm.weight", "text_decoder.layers.4.ffn_layer_norm.bias", "text_decoder.layers.5.self_attn.k_proj.weight", "text_decoder.layers.5.self_attn.k_proj.bias", "text_decoder.layers.5.self_attn.v_proj.weight", "text_decoder.layers.5.self_attn.v_proj.bias", "text_decoder.layers.5.self_attn.q_proj.weight", "text_decoder.layers.5.self_attn.q_proj.bias", "text_decoder.layers.5.self_attn.out_proj.weight", "text_decoder.layers.5.self_attn.out_proj.bias", "text_decoder.layers.5.self_attn_layer_norm.weight", "text_decoder.layers.5.self_attn_layer_norm.bias", "text_decoder.layers.5.cross_attention.k_proj.weight", "text_decoder.layers.5.cross_attention.k_proj.bias", "text_decoder.layers.5.cross_attention.v_proj.weight", "text_decoder.layers.5.cross_attention.v_proj.bias", "text_decoder.layers.5.cross_attention.q_proj.weight", "text_decoder.layers.5.cross_attention.q_proj.bias", "text_decoder.layers.5.cross_attention.out_proj.weight", "text_decoder.layers.5.cross_attention.out_proj.bias", "text_decoder.layers.5.cross_attention_layer_norm.weight", "text_decoder.layers.5.cross_attention_layer_norm.bias", "text_decoder.layers.5.ffn.fc1.weight", "text_decoder.layers.5.ffn.fc1.bias", "text_decoder.layers.5.ffn.fc2.weight", "text_decoder.layers.5.ffn.fc2.bias", "text_decoder.layers.5.ffn_layer_norm.weight", "text_decoder.layers.5.ffn_layer_norm.bias", "text_decoder.layers.6.self_attn.k_proj.weight", "text_decoder.layers.6.self_attn.k_proj.bias", "text_decoder.layers.6.self_attn.v_proj.weight", "text_decoder.layers.6.self_attn.v_proj.bias", "text_decoder.layers.6.self_attn.q_proj.weight", "text_decoder.layers.6.self_attn.q_proj.bias", "text_decoder.layers.6.self_attn.out_proj.weight", "text_decoder.layers.6.self_attn.out_proj.bias", "text_decoder.layers.6.self_attn_layer_norm.weight", "text_decoder.layers.6.self_attn_layer_norm.bias", "text_decoder.layers.6.cross_attention.k_proj.weight", "text_decoder.layers.6.cross_attention.k_proj.bias", "text_decoder.layers.6.cross_attention.v_proj.weight", "text_decoder.layers.6.cross_attention.v_proj.bias", "text_decoder.layers.6.cross_attention.q_proj.weight", "text_decoder.layers.6.cross_attention.q_proj.bias", "text_decoder.layers.6.cross_attention.out_proj.weight", "text_decoder.layers.6.cross_attention.out_proj.bias", "text_decoder.layers.6.cross_attention_layer_norm.weight", "text_decoder.layers.6.cross_attention_layer_norm.bias", "text_decoder.layers.6.ffn.fc1.weight", "text_decoder.layers.6.ffn.fc1.bias", "text_decoder.layers.6.ffn.fc2.weight", "text_decoder.layers.6.ffn.fc2.bias", "text_decoder.layers.6.ffn_layer_norm.weight", "text_decoder.layers.6.ffn_layer_norm.bias", "text_decoder.layers.7.self_attn.k_proj.weight", "text_decoder.layers.7.self_attn.k_proj.bias", "text_decoder.layers.7.self_attn.v_proj.weight", "text_decoder.layers.7.self_attn.v_proj.bias", "text_decoder.layers.7.self_attn.q_proj.weight", "text_decoder.layers.7.self_attn.q_proj.bias", "text_decoder.layers.7.self_attn.out_proj.weight", "text_decoder.layers.7.self_attn.out_proj.bias", "text_decoder.layers.7.self_attn_layer_norm.weight", "text_decoder.layers.7.self_attn_layer_norm.bias", "text_decoder.layers.7.cross_attention.k_proj.weight", "text_decoder.layers.7.cross_attention.k_proj.bias", "text_decoder.layers.7.cross_attention.v_proj.weight", "text_decoder.layers.7.cross_attention.v_proj.bias", "text_decoder.layers.7.cross_attention.q_proj.weight", "text_decoder.layers.7.cross_attention.q_proj.bias", "text_decoder.layers.7.cross_attention.out_proj.weight", "text_decoder.layers.7.cross_attention.out_proj.bias", "text_decoder.layers.7.cross_attention_layer_norm.weight", "text_decoder.layers.7.cross_attention_layer_norm.bias", "text_decoder.layers.7.ffn.fc1.weight", "text_decoder.layers.7.ffn.fc1.bias", "text_decoder.layers.7.ffn.fc2.weight", "text_decoder.layers.7.ffn.fc2.bias", "text_decoder.layers.7.ffn_layer_norm.weight", "text_decoder.layers.7.ffn_layer_norm.bias", "text_decoder.layers.8.self_attn.k_proj.weight", "text_decoder.layers.8.self_attn.k_proj.bias", "text_decoder.layers.8.self_attn.v_proj.weight", "text_decoder.layers.8.self_attn.v_proj.bias", "text_decoder.layers.8.self_attn.q_proj.weight", "text_decoder.layers.8.self_attn.q_proj.bias", "text_decoder.layers.8.self_attn.out_proj.weight", "text_decoder.layers.8.self_attn.out_proj.bias", "text_decoder.layers.8.self_attn_layer_norm.weight", "text_decoder.layers.8.self_attn_layer_norm.bias", "text_decoder.layers.8.cross_attention.k_proj.weight", "text_decoder.layers.8.cross_attention.k_proj.bias", "text_decoder.layers.8.cross_attention.v_proj.weight", "text_decoder.layers.8.cross_attention.v_proj.bias", "text_decoder.layers.8.cross_attention.q_proj.weight", "text_decoder.layers.8.cross_attention.q_proj.bias", "text_decoder.layers.8.cross_attention.out_proj.weight", "text_decoder.layers.8.cross_attention.out_proj.bias", "text_decoder.layers.8.cross_attention_layer_norm.weight", "text_decoder.layers.8.cross_attention_layer_norm.bias", "text_decoder.layers.8.ffn.fc1.weight", "text_decoder.layers.8.ffn.fc1.bias", "text_decoder.layers.8.ffn.fc2.weight", "text_decoder.layers.8.ffn.fc2.bias", "text_decoder.layers.8.ffn_layer_norm.weight", "text_decoder.layers.8.ffn_layer_norm.bias", "text_decoder.layers.9.self_attn.k_proj.weight", "text_decoder.layers.9.self_attn.k_proj.bias", "text_decoder.layers.9.self_attn.v_proj.weight", "text_decoder.layers.9.self_attn.v_proj.bias", "text_decoder.layers.9.self_attn.q_proj.weight", "text_decoder.layers.9.self_attn.q_proj.bias", "text_decoder.layers.9.self_attn.out_proj.weight", "text_decoder.layers.9.self_attn.out_proj.bias", "text_decoder.layers.9.self_attn_layer_norm.weight", "text_decoder.layers.9.self_attn_layer_norm.bias", "text_decoder.layers.9.cross_attention.k_proj.weight", "text_decoder.layers.9.cross_attention.k_proj.bias", "text_decoder.layers.9.cross_attention.v_proj.weight", "text_decoder.layers.9.cross_attention.v_proj.bias", "text_decoder.layers.9.cross_attention.q_proj.weight", "text_decoder.layers.9.cross_attention.q_proj.bias", "text_decoder.layers.9.cross_attention.out_proj.weight", "text_decoder.layers.9.cross_attention.out_proj.bias", "text_decoder.layers.9.cross_attention_layer_norm.weight", "text_decoder.layers.9.cross_attention_layer_norm.bias", "text_decoder.layers.9.ffn.fc1.weight", "text_decoder.layers.9.ffn.fc1.bias", "text_decoder.layers.9.ffn.fc2.weight", "text_decoder.layers.9.ffn.fc2.bias", "text_decoder.layers.9.ffn_layer_norm.weight", "text_decoder.layers.9.ffn_layer_norm.bias", "text_decoder.layers.10.self_attn.k_proj.weight", "text_decoder.layers.10.self_attn.k_proj.bias", "text_decoder.layers.10.self_attn.v_proj.weight", "text_decoder.layers.10.self_attn.v_proj.bias", "text_decoder.layers.10.self_attn.q_proj.weight", "text_decoder.layers.10.self_attn.q_proj.bias", "text_decoder.layers.10.self_attn.out_proj.weight", "text_decoder.layers.10.self_attn.out_proj.bias", "text_decoder.layers.10.self_attn_layer_norm.weight", "text_decoder.layers.10.self_attn_layer_norm.bias", "text_decoder.layers.10.cross_attention.k_proj.weight", "text_decoder.layers.10.cross_attention.k_proj.bias", "text_decoder.layers.10.cross_attention.v_proj.weight", "text_decoder.layers.10.cross_attention.v_proj.bias", "text_decoder.layers.10.cross_attention.q_proj.weight", "text_decoder.layers.10.cross_attention.q_proj.bias", "text_decoder.layers.10.cross_attention.out_proj.weight", "text_decoder.layers.10.cross_attention.out_proj.bias", "text_decoder.layers.10.cross_attention_layer_norm.weight", "text_decoder.layers.10.cross_attention_layer_norm.bias", "text_decoder.layers.10.ffn.fc1.weight", "text_decoder.layers.10.ffn.fc1.bias", "text_decoder.layers.10.ffn.fc2.weight", "text_decoder.layers.10.ffn.fc2.bias", "text_decoder.layers.10.ffn_layer_norm.weight", "text_decoder.layers.10.ffn_layer_norm.bias", "text_decoder.layers.11.self_attn.k_proj.weight", "text_decoder.layers.11.self_attn.k_proj.bias", "text_decoder.layers.11.self_attn.v_proj.weight", "text_decoder.layers.11.self_attn.v_proj.bias", "text_decoder.layers.11.self_attn.q_proj.weight", "text_decoder.layers.11.self_attn.q_proj.bias", "text_decoder.layers.11.self_attn.out_proj.weight", "text_decoder.layers.11.self_attn.out_proj.bias", "text_decoder.layers.11.self_attn_layer_norm.weight", "text_decoder.layers.11.self_attn_layer_norm.bias", "text_decoder.layers.11.cross_attention.k_proj.weight", "text_decoder.layers.11.cross_attention.k_proj.bias", "text_decoder.layers.11.cross_attention.v_proj.weight", "text_decoder.layers.11.cross_attention.v_proj.bias", "text_decoder.layers.11.cross_attention.q_proj.weight", "text_decoder.layers.11.cross_attention.q_proj.bias", "text_decoder.layers.11.cross_attention.out_proj.weight", "text_decoder.layers.11.cross_attention.out_proj.bias", "text_decoder.layers.11.cross_attention_layer_norm.weight", "text_decoder.layers.11.cross_attention_layer_norm.bias", "text_decoder.layers.11.ffn.fc1.weight", "text_decoder.layers.11.ffn.fc1.bias", "text_decoder.layers.11.ffn.fc2.weight", "text_decoder.layers.11.ffn.fc2.bias", "text_decoder.layers.11.ffn_layer_norm.weight", "text_decoder.layers.11.ffn_layer_norm.bias", "text_decoder.layer_norm.weight", "text_decoder.layer_norm.bias", "lm_head.weight", "t2u_model.model.encoder.layers.0.self_attn.k_proj.weight", "t2u_model.model.encoder.layers.0.self_attn.k_proj.bias", "t2u_model.model.encoder.layers.0.self_attn.v_proj.weight", "t2u_model.model.encoder.layers.0.self_attn.v_proj.bias", "t2u_model.model.encoder.layers.0.self_attn.q_proj.weight", "t2u_model.model.encoder.layers.0.self_attn.q_proj.bias", "t2u_model.model.encoder.layers.0.self_attn.out_proj.weight", "t2u_model.model.encoder.layers.0.self_attn.out_proj.bias", "t2u_model.model.encoder.layers.0.self_attn_layer_norm.weight", "t2u_model.model.encoder.layers.0.self_attn_layer_norm.bias", "t2u_model.model.encoder.layers.0.ffn.fc1.weight", "t2u_model.model.encoder.layers.0.ffn.fc1.bias", "t2u_model.model.encoder.layers.0.ffn.fc2.weight", "t2u_model.model.encoder.layers.0.ffn.fc2.bias", "t2u_model.model.encoder.layers.0.ffn_layer_norm.weight", "t2u_model.model.encoder.layers.0.ffn_layer_norm.bias", "t2u_model.model.encoder.layers.1.self_attn.k_proj.weight", "t2u_model.model.encoder.layers.1.self_attn.k_proj.bias", "t2u_model.model.encoder.layers.1.self_attn.v_proj.weight", "t2u_model.model.encoder.layers.1.self_attn.v_proj.bias", "t2u_model.model.encoder.layers.1.self_attn.q_proj.weight", "t2u_model.model.encoder.layers.1.self_attn.q_proj.bias", "t2u_model.model.encoder.layers.1.self_attn.out_proj.weight", "t2u_model.model.encoder.layers.1.self_attn.out_proj.bias", "t2u_model.model.encoder.layers.1.self_attn_layer_norm.weight", "t2u_model.model.encoder.layers.1.self_attn_layer_norm.bias", "t2u_model.model.encoder.layers.1.ffn.fc1.weight", "t2u_model.model.encoder.layers.1.ffn.fc1.bias", "t2u_model.model.encoder.layers.1.ffn.fc2.weight", "t2u_model.model.encoder.layers.1.ffn.fc2.bias", "t2u_model.model.encoder.layers.1.ffn_layer_norm.weight", "t2u_model.model.encoder.layers.1.ffn_layer_norm.bias", 
    ...
    Unexpected key(s) in state_dict: "model_name", "model". 

Expected behavior

Expected to load the new fintuned model and then save it to a new model file.

amyeroberts commented 2 weeks ago

Hi @ivanhe123, thanks for opening this issue!

From the error, it looks like the keys in the state dict new_model do not match those in the model SeamlessM4TModel. You can check the expected keys in the model by doing model_seam.state_dict().keys().

Note, it's not necessary for you to download and load in a pretrained checkpoint and then load in new weights. You can initialize a new model with the same architecture and empty weights by just downloading the config:

import torch
from accelerate import init_empty_weights
from transformers import AutoConfig, SeamlessM4TModel

config = AutoConfig.from_pretrained("facebook/hf-seamless-m4t-medium")

with init_empty_weights():
    model = SeamlessM4TModel(config)

new_model = torch.load("./expt4_m4tM.pt")
model.load_state_dict(new_model)