microsoft / UniSpeech

UniSpeech - Large Scale Self-Supervised Learning for Speech
Other
406 stars 71 forks source link

Is "unispeech_sat.th" wrong ? #14

Closed Damien-Da closed 2 years ago

Damien-Da commented 2 years ago

Hello,

I think the "unispeech_sat.th" is wrong. I have just cloned the repository and tried the speaker verification with Unispeech-SAT and when I launch the example \:

python verification.py --model_name unispeech_sat --wav1 vox1_data/David_Faustino/hn8GyCJIfLM_0000012.wav --wav2 vox1_data/Josh_Gad/HXUqYaOwrxA_0000015.wav --checkpoint UniSpeech-SAT-Large.pt

I have an error (end of the traceback)\: File "/data/coros1/ddallon/workspace/UniSpeech/UniSpeech-SAT/fairseq/models/__init__.py", line 88, in build_model assert model is not None, ( AssertionError: Could not infer model type from {'_name': 'bc_m_hubert', 'label_rate': 50, 'extractor_mode': 'layer_norm', 'structure_type': 'transformer', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': 'gelu', 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 768, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 1.0, 'boundary_mask': False, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'relative_position_embedding': False, 'num_buckets': 320, 'max_distance': 1280, 'gru_rel_pos': False, 'expand_attention_head_size': -1, 'streaming': False, 'chunk_size': 0, 'left_chunk': 0, 'num_negatives': 0, 'negatives_from_everywhere': False, 'cross_sample_negatives': 100, 'codebook_negatives': 0, 'quantize_targets': True, 'latent_vars': 320, 'latent_groups': 2, 'latent_dim': 0, 'spk_layer': 12, 'mixing_max_len': -1, 'mixing_prob': 0.5, 'mixing_num': 1, 'pretrained_path': ''}. Available models: dict_keys(['wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'wav2vec_transducer', 'hubert', 'hubert_ctc', 'transformer_lm', 'unispeech_sat']) Requested model type: bc_m_hubert

And I notice that "bc_m_hubert" appears only in "unispeech_sat.th".

Could you check it or help me ? :-)

czy97 commented 2 years ago

Hello,

I think the "unispeech_sat.th" is wrong. I have just cloned the repository and tried the speaker verification with Unispeech-SAT and when I launch the example :

python verification.py --model_name unispeech_sat --wav1 vox1_data/David_Faustino/hn8GyCJIfLM_0000012.wav --wav2 vox1_data/Josh_Gad/HXUqYaOwrxA_0000015.wav --checkpoint UniSpeech-SAT-Large.pt

I have an error (end of the traceback): File "/data/coros1/ddallon/workspace/UniSpeech/UniSpeech-SAT/fairseq/models/__init__.py", line 88, in build_model assert model is not None, ( AssertionError: Could not infer model type from {'_name': 'bc_m_hubert', 'label_rate': 50, 'extractor_mode': 'layer_norm', 'structure_type': 'transformer', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': 'gelu', 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 768, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 1.0, 'boundary_mask': False, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'relative_position_embedding': False, 'num_buckets': 320, 'max_distance': 1280, 'gru_rel_pos': False, 'expand_attention_head_size': -1, 'streaming': False, 'chunk_size': 0, 'left_chunk': 0, 'num_negatives': 0, 'negatives_from_everywhere': False, 'cross_sample_negatives': 100, 'codebook_negatives': 0, 'quantize_targets': True, 'latent_vars': 320, 'latent_groups': 2, 'latent_dim': 0, 'spk_layer': 12, 'mixing_max_len': -1, 'mixing_prob': 0.5, 'mixing_num': 1, 'pretrained_path': ''}. Available models: dict_keys(['wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'wav2vec_transducer', 'hubert', 'hubert_ctc', 'transformer_lm', 'unispeech_sat']) Requested model type: bc_m_hubert

And I notice that "bc_m_hubert" appears only in "unispeech_sat.th".

Could you check it or help me ? :-)

Sorry for the mistake. We have changed the model name in the code but not updated it in the "unispeech_sat.th". Now, we also update the name in "unispeech_sat.th": https://github.com/microsoft/UniSpeech/commit/2816e682dcec4661384cfe0c01f4121641c9954c.

Damien-Da commented 2 years ago

Thank you, it works :-) But the similarity score given is 0.9981, while it should be 0.0317 ... I will seek.

czy97 commented 2 years ago

Thank you, it works :-) But the similarity score given is 0.9981, while it should be 0.0317 ... I will seek.

I guess you only randomly intialized the model and did not use to the pre-trained model here: https://drive.google.com/file/d/10o6NHZsPXJn2k8n57e8Z_FkKh3V4TC3g/view.

czy97 commented 2 years ago

Thank you, it works :-) But the similarity score given is 0.9981, while it should be 0.0317 ... I will seek.

I guess you only randomly intialized the model and did not use to the pre-trained model here: https://drive.google.com/file/d/10o6NHZsPXJn2k8n57e8Z_FkKh3V4TC3g/view.

The "unispeech_sat.th" only stores the config to initialize the Unispeech_SAT model, not the model parameters checkpoint.

Damien-Da commented 2 years ago

Thank you, it works :-) But the similarity score given is 0.9981, while it should be 0.0317 ... I will seek.

I guess you only randomly intialized the model and did not use to the pre-trained model here: https://drive.google.com/file/d/10o6NHZsPXJn2k8n57e8Z_FkKh3V4TC3g/view.

The "unispeech_sat.th" only stores the config to initialize the Unispeech_SAT model, not the model parameters checkpoint.

Thank you it works well ! :-)

leijue222 commented 2 years ago

I run verification.py with error of

from fairseq.models.wav2vec import Wav2VecModel
ImportError: cannot import name 'Wav2VecModel' from 'fairseq.models.wav2vec'

By the way, Does this repository on contain infer recipe of verification.py? How about trainning code?

omarelejla commented 1 year ago

Hello,

I think the "unispeech_sat.th" is wrong. I have just cloned the repository and tried the speaker verification with Unispeech-SAT and when I launch the example :

python verification.py --model_name unispeech_sat --wav1 vox1_data/David_Faustino/hn8GyCJIfLM_0000012.wav --wav2 vox1_data/Josh_Gad/HXUqYaOwrxA_0000015.wav --checkpoint UniSpeech-SAT-Large.pt

I have an error (end of the traceback): File "/data/coros1/ddallon/workspace/UniSpeech/UniSpeech-SAT/fairseq/models/__init__.py", line 88, in build_model assert model is not None, ( AssertionError: Could not infer model type from {'_name': 'bc_m_hubert', 'label_rate': 50, 'extractor_mode': 'layer_norm', 'structure_type': 'transformer', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': 'gelu', 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 768, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 1.0, 'boundary_mask': False, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'relative_position_embedding': False, 'num_buckets': 320, 'max_distance': 1280, 'gru_rel_pos': False, 'expand_attention_head_size': -1, 'streaming': False, 'chunk_size': 0, 'left_chunk': 0, 'num_negatives': 0, 'negatives_from_everywhere': False, 'cross_sample_negatives': 100, 'codebook_negatives': 0, 'quantize_targets': True, 'latent_vars': 320, 'latent_groups': 2, 'latent_dim': 0, 'spk_layer': 12, 'mixing_max_len': -1, 'mixing_prob': 0.5, 'mixing_num': 1, 'pretrained_path': ''}. Available models: dict_keys(['wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'wav2vec_transducer', 'hubert', 'hubert_ctc', 'transformer_lm', 'unispeech_sat']) Requested model type: bc_m_hubert

And I notice that "bc_m_hubert" appears only in "unispeech_sat.th".

Could you check it or help me ? :-)

also

I am working on a speaker verification example.

when I used the hubert model, my code is working well.

but I am facing the same problem mentioned here with the unispeach model, I do not understand what was the mistake that @czy97 has solved by updating the code. I downloaded the repo recently and I tried to use the unispeech_sat_large_finetune model the initial sat model named unispeech_sat.th and the pre-trained model named unispeech_sat_large_finetune.pth

I also used unispeach sat fairseq as mentioned in GitHub for using unispeech_sat model

@czy97 @Damien-Da

omarelejla commented 1 year ago

Hello, I think the "unispeech_sat.th" is wrong. I have just cloned the repository and tried the speaker verification with Unispeech-SAT and when I launch the example : python verification.py --model_name unispeech_sat --wav1 vox1_data/David_Faustino/hn8GyCJIfLM_0000012.wav --wav2 vox1_data/Josh_Gad/HXUqYaOwrxA_0000015.wav --checkpoint UniSpeech-SAT-Large.pt I have an error (end of the traceback): File "/data/coros1/ddallon/workspace/UniSpeech/UniSpeech-SAT/fairseq/models/__init__.py", line 88, in build_model assert model is not None, ( AssertionError: Could not infer model type from {'_name': 'bc_m_hubert', 'label_rate': 50, 'extractor_mode': 'layer_norm', 'structure_type': 'transformer', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': 'gelu', 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 768, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 1.0, 'boundary_mask': False, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'relative_position_embedding': False, 'num_buckets': 320, 'max_distance': 1280, 'gru_rel_pos': False, 'expand_attention_head_size': -1, 'streaming': False, 'chunk_size': 0, 'left_chunk': 0, 'num_negatives': 0, 'negatives_from_everywhere': False, 'cross_sample_negatives': 100, 'codebook_negatives': 0, 'quantize_targets': True, 'latent_vars': 320, 'latent_groups': 2, 'latent_dim': 0, 'spk_layer': 12, 'mixing_max_len': -1, 'mixing_prob': 0.5, 'mixing_num': 1, 'pretrained_path': ''}. Available models: dict_keys(['wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'wav2vec_transducer', 'hubert', 'hubert_ctc', 'transformer_lm', 'unispeech_sat']) Requested model type: bc_m_hubert And I notice that "bc_m_hubert" appears only in "unispeech_sat.th". Could you check it or help me ? :-)

also

I am working on a speaker verification example.

when I used the hubert model, my code is working well.

but I am facing the same problem mentioned here with the unispeach model, I do not understand what was the mistake that @czy97 has solved by updating the code. I downloaded the repo recently and I tried to use the unispeech_sat_large_finetune model the initial sat model named unispeech_sat.th and the pre-trained model named unispeech_sat_large_finetune.pth

I also used unispeach sat fairseq as mentioned in GitHub for using unispeech_sat model

@czy97 @Damien-Da

I run the verification.py file in the downstream folder instead of the one in the src folder and it works fine for the two models (unispeech_sat_large_finetune.pth and hubert_large_finetune.pth) but the code is slow (10 seconds to produce similarity between two waves)

any suggestion?

czy97 commented 1 year ago

Hello, I think the "unispeech_sat.th" is wrong. I have just cloned the repository and tried the speaker verification with Unispeech-SAT and when I launch the example : python verification.py --model_name unispeech_sat --wav1 vox1_data/David_Faustino/hn8GyCJIfLM_0000012.wav --wav2 vox1_data/Josh_Gad/HXUqYaOwrxA_0000015.wav --checkpoint UniSpeech-SAT-Large.pt I have an error (end of the traceback): File "/data/coros1/ddallon/workspace/UniSpeech/UniSpeech-SAT/fairseq/models/__init__.py", line 88, in build_model assert model is not None, ( AssertionError: Could not infer model type from {'_name': 'bc_m_hubert', 'label_rate': 50, 'extractor_mode': 'layer_norm', 'structure_type': 'transformer', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': 'gelu', 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 768, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 1.0, 'boundary_mask': False, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'relative_position_embedding': False, 'num_buckets': 320, 'max_distance': 1280, 'gru_rel_pos': False, 'expand_attention_head_size': -1, 'streaming': False, 'chunk_size': 0, 'left_chunk': 0, 'num_negatives': 0, 'negatives_from_everywhere': False, 'cross_sample_negatives': 100, 'codebook_negatives': 0, 'quantize_targets': True, 'latent_vars': 320, 'latent_groups': 2, 'latent_dim': 0, 'spk_layer': 12, 'mixing_max_len': -1, 'mixing_prob': 0.5, 'mixing_num': 1, 'pretrained_path': ''}. Available models: dict_keys(['wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'wav2vec_transducer', 'hubert', 'hubert_ctc', 'transformer_lm', 'unispeech_sat']) Requested model type: bc_m_hubert And I notice that "bc_m_hubert" appears only in "unispeech_sat.th". Could you check it or help me ? :-)

also

I am working on a speaker verification example.

when I used the hubert model, my code is working well.

but I am facing the same problem mentioned here with the unispeach model, I do not understand what was the mistake that @czy97 has solved by updating the code. I downloaded the repo recently and I tried to use the unispeech_sat_large_finetune model the initial sat model named unispeech_sat.th and the pre-trained model named unispeech_sat_large_finetune.pth

I also used unispeach sat fairseq as mentioned in GitHub for using unispeech_sat model

@czy97 @Damien-Da

Hi, omarelejla. Some checkpoints have been mistakenly deleted recently and the old checkpoints are uploaded. As said above, we have updated the code before and the model name in the checkpoint should match the code. However, the old checkpoints are not updated. Besides, @Sanyuan-Chen uploaded these old checkpoint. @Sanyuan-Chen, can you help to update the model name in the old checkpoints? Thanks.

omarelejla commented 1 year ago

@czy97 thanks for your fast reply i solved the problem by using verfication.py in downstreams folder

can you guide me to reduce time taken by the code to do verification? it consumes around 10 seconds now

czy97 commented 1 year ago

@omarelejla Actually, the pre-trained model is very large. Besides, if your audio is very long, it will take some time. If you just want to do verification quickly, you can use some small models without large pre-train model as backend.

omarelejla commented 1 year ago

@czy97 @Damien-Da Hi, I applied unispeech to VOX 1 dataset but I did not get the same EER mentioned on GitHub. is it because I did not do score normalization that is mentioned in the paper " [Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification]" if yes, how i can do the mentioned adaptive s-normalization in python, please?

czy97 commented 1 year ago

@czy97 @Damien-Da Hi, I applied unispeech to VOX 1 dataset but I did not get the same EER mentioned on GitHub. is it because I did not do score normalization that is mentioned in the paper " [Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification]" if yes, how i can do the mentioned adaptive s-normalization in python, please?

You can refer to article "Comparison of speaker recognition approaches for real application". Besides, there are some released speaker verification tools which support the implementation of score normalization, such as wespeaker.

omarelejla commented 1 year ago

@czy97 sorry I spent much time trying to understand how to apply as norm to my similarities I did apply unispeech on VOX1 ( list of trails contains 37611 pairs ) and I calculated similarities for each how to do normalization fo these similarities to get the same result reported on GitHub and paper

this is the equation what exactly is the meaning of each term, please? in other words, μ is mean for what? σ is standard deviation for what? what are N1 and N2?

image

czy97 commented 1 year ago

@omarelejla There is indeed a lot of concepts you need to know. I'm sorry I can't explain it clearly in a few paragraphs. Just as I recommend above, maybe you can refer to some code. Like the code in wespeaker.

Besides, N1 denotes the similarity scores between enrollment utterance and the imposter cohort sets. You can think of this imposter cohort set as the speaker embeddings extracted from training set.