microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.16k stars 113 forks source link

how to pre-train on a custom dataset ? #1

Closed StephennFernandes closed 2 years ago

StephennFernandes commented 2 years ago

Hey there, I am looking forward. to pre-training SpeechT5 on a custom dataset. preferably multi-lingual datasets. could i please get some references, documentations etc as a starting point to get started on the same please. Thanks.

Ajyy commented 2 years ago

Hi,

Please follow the steps in here to prepare the data.

You may also need to extract HuBERT label on your custom dataset and prepare the speaker embedding (by using the multi-lingual speaker model). After that, you can follow our scripts to train the model.

StephennFernandes commented 2 years ago

@Ajyy , I see on your Huggingface account you have uploaded speechT5 to Huggingface is it functional yet ?

StephennFernandes commented 2 years ago

@Ajyy by any chance would you be releasing speechT5 to Huggingface ?

StephennFernandes commented 1 year ago

@Ajyy

Hi, Can you show me any resources and links where i can extract HuBERT label on my custom dataset and prepare the speaker embeddings, FYI i have a multi-lingual dataset

Ajyy commented 1 year ago

Hi, you can find more details in here about extracting HuBERT label on your dataset and here to get more information about how to extract speaker embedding.

Sorry for the late reply.

StephennFernandes commented 1 year ago

Hey @Ajyy Thanks a ton for replying back ! means a lot

StephennFernandes commented 1 year ago

@Ajyy

I have noitced that for hubert extraction there is not multilingual HuBERT model available. so should i use the existing english HuBERT model ? would this affect my multilingual speechT5 models performance during pretraining and finetuning ?

Ajyy commented 1 year ago

Hi, I think you can try to use mHuBERT for multilingual SpeechT5. The english HuBERT model may affect the performance.

StephennFernandes commented 1 year ago

@Ajyy thanks a ton, using mHuBERT now for extracting HuBERT feature, btw which layer is it ideal to extract those feature from is it 6 or 11 ? or something else

StephennFernandes commented 1 year ago

@Ajyy post obtaining: wav2vec2 manifest, mfcc feature and hubert feature i am only left with obtraining xvectors on the pretraining data to move forward with speechT5 training, But how could i obtain xvectors on my training data? I tried #16 but that's not working

Ajyy commented 1 year ago

@Ajyy thanks a ton, using mHuBERT now for extracting HuBERT feature, btw which layer is it ideal to extract those feature from is it 6 or 11 ? or something else

It should be 6 for HuBERT, and 11 for mHuBERT. Please refer to the original papers.

Ajyy commented 1 year ago

@Ajyy post obtaining: wav2vec2 manifest, mfcc feature and hubert feature i am only left with obtraining xvectors on the pretraining data to move forward with speechT5 training, But how could i obtain xvectors on my training data? I tried #16 but that's not working

Please try to read and understand the scripts as provided by @mechanicalsea . You need to change it a little bit for your own dataset.

StephennFernandes commented 1 year ago

@Ajyy okay got it!

thanks a ton

StephennFernandes commented 1 year ago

@Ajyy when extracting hubert labels whats the ideal n_cluster value to be set ?

StephennFernandes commented 1 year ago

@Ajyy also slightly confused about the fairseq-preprocess script, could you show me how do i work around that, i have a large train.txt and valid.txt file for which i have used spm.model to encode them into train_encoded.txt and valid_encoded.txt

But i am confused about what it means about a dictionary creation and .bin files and how to feed then into fairseq-preprocess

Ajyy commented 1 year ago

@Ajyy also slightly confused about the fairseq-preprocess script, could you show me how do i work around that, i have a large train.txt and valid.txt file for which i have used spm.model to encode them into train_encoded.txt and valid_encoded.txt

But i am confused about what it means about a dictionary creation and .bin files and how to feed then into fairseq-preprocess

You can check the preprocess for language model of fairseq.

After preprocess by fairseq-preprocess, you will get a bin and index file for your dataset, which can be read faster compared to txt file.