microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.09k stars 113 forks source link

how to fine tune sid on pretrained model? #42

Closed haha010508 closed 1 year ago

haha010508 commented 1 year ago

i want to run the sid pretrain model, but i got an error like this: generate_class.py: error: argument --task: invalid choice: 'speecht5' (choose from 'masked_lm', 'cross_lingual_lm', 'translation', 'hubert_pretraining', 'online_backtranslation', 'denoising', 'multilingual_denoising', 'translation_multi_simple_epoch', 'legacy_masked_lm', 'translation_from_pretrained_bart', 'language_modeling', 'multilingual_translation', 'sentence_prediction', 'sentence_ranking', 'translation_lev', 'audio_pretraining', 'translation_from_pretrained_xlm', 'multilingual_masked_lm', 'speech_to_text', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt') if i must fine tune sid, i run fine tune sid, and got this error fairseq-train: error: argument --task: invalid choice: 'speecht5' (choose from 'masked_lm', 'cross_lingual_lm', 'translation', 'hubert_pretraining', 'online_backtranslation', 'denoising', 'multilingual_denoising', 'translation_multi_simple_epoch', 'legacy_masked_lm', 'translation_from_pretrained_bart', 'language_modeling', 'multilingual_translation', 'sentence_prediction', 'sentence_ranking', 'translation_lev', 'audio_pretraining', 'translation_from_pretrained_xlm', 'multilingual_masked_lm', 'speech_to_text', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt') SID finetuning finished so, how to run the model correctly? thanks!

mechanicalsea commented 1 year ago

It seem USER_DIR was not given. We would set USER_DIR to speecht5 code directory so that the fairseq adds 'speecht5' task into its task list.

haha010508 commented 1 year ago

This is my code: `CHECKPOINT_PATH=/project/SpeechT5/SpeechT5/pretrained_models/speecht5_sid.pt DATA_ROOT=/project/SpeechT5/SpeechT5/manifest SUBSET=test USER_DIR=/project/SpeechT5/SpeechT5/speecht5 RESULTS_PATH=/project/SpeechT5/SpeechT5/experimental/s2c/results

mkdir -p ${RESULTS_PATH}

python scripts/generate_class.py ${DATA_ROOT} \ --gen-subset ${SUBSET} \ --user-dir ${USER_DIR} \ --log-format json \ --task speecht5 \ --t5-task s2c \ --path ${CHECKPOINT_PATH} \ --results-path ${RESULTS_PATH} \ --batch-size 1 \ --max-speech-positions 8000 \ --sample-rate 16000 | tee -a ${RESULTS_PATH}/generate-class.txt`

and , if i debug the code, i got this error:
python -m ipdb scripts/generate_class.py ... ImportError: cannot import name 'metrics' from 'fairseq' (unknown location)

and if i run the code, i got this error: soundfile.LibsndfileError: <exception str() failed> does this mean no wav file? if right, how to specify file path? i have dowanload the vox1 already.

i dont know why i can not debug the code? and why the debug and run the code have different error? Thanks very much!

mechanicalsea commented 1 year ago

The error of import seems as Multi-GPU training doesn't work when --user-dir specified. Move or link the USER_DIR in the directory of fairseq/examples and use it as USER_DIR. The issue occurs at https://github.com/facebookresearch/fairseq/issues/4875.

haha010508 commented 1 year ago

The error of import seems as Multi-GPU training doesn't work when --user-dir specified. Move or link the USER_DIR in the directory of fairseq/examples and use it as USER_DIR. The issue occurs at facebookresearch/fairseq#4875.

Thanks for your reply,i try it. but got same error

haha010508 commented 1 year ago

i find this is a bug from fairseq import metrics, search, tokenizer, utils got this error ImportError: cannot import name 'metrics' from 'fairseq' (unknown location) and the metrics file in fairseq/logging

mechanicalsea commented 1 year ago

i find this is a bug from fairseq import metrics, search, tokenizer, utils got this error ImportError: cannot import name 'metrics' from 'fairseq' (unknown location) and the metrics file in fairseq/logging

It seems an issue caused by the version of torch. The issue occurs when I reimplement the SpeechT5 in a new environment. Could you provide some details of your computer environment. By the way, I usually conduct the SpeechT5 using 1.10.x torch.

haha010508 commented 1 year ago

The issue caused by fairseq, you need move the metrics.py and meters.py from fairseq/logging to fairseq folder, and then the error disappeared. my torch version is 2.0.0, but this version not installed by me, it is installed by fairseq or espnet

haha010508 commented 1 year ago

By the way, if use VoxCeleb1-O to evaluate the speaker class performance? how about the EER score? usually, we enroll some speakers in dataset, and in test, we get an embedding and use cos similar to compute the similar with enrolled speakers embedding, and decide if they are same speaker or not, the speaker verify is not use speaker classify method. so compare with ECAPA-TDNN, How about the speaker_sid model performance?

mechanicalsea commented 1 year ago

By the way, if use VoxCeleb1-O to evaluate the speaker class performance? how about the EER score? usually, we enroll some speakers in dataset, and in test, we get an embedding and use cos similar to compute the similar with enrolled speakers embedding, and decide if they are same speaker or not, the speaker verify is not use speaker classify method. so compare with ECAPA-TDNN, How about the speaker_sid model performance?

For SID, the fune-tuned SpeechT5 produce 96.46% accuracy. The paper of ECAPA-TDNN did not report VoxCeleb1 SID results, making it difficult to compare with SpeechT5. For ASV (report EER score), the fune-tuned SpeechT5 did not conduct this task. If we would like to compare SpeechT5 and ECAPA-TDNN, we first extract speaker embedding from SpeechT5. Generally speaking, we can consider the hidden state before the input of decoder's classifier as speaker embedding, making it available to compare with ECAPA-TDNN. Or we could create a speaker model as Transformer variant (a) to obtain speaker embeddings.

haha010508 commented 1 year ago

so we can get the speaker embedding from this line: https://github.com/microsoft/SpeechT5/blob/7134e960999bc20d1d80650f7361f35d5fd8d38a/SpeechT5/speecht5/models/speecht5.py#L1183 right? a 768 dim data?

mechanicalsea commented 1 year ago

so we can get the speaker embedding from this line:

https://github.com/microsoft/SpeechT5/blob/7134e960999bc20d1d80650f7361f35d5fd8d38a/SpeechT5/speecht5/models/speecht5.py#L1183

right? a 768 dim data?

yes