microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.09k stars 113 forks source link

British English TTS model #69

Closed omega3 closed 2 months ago

omega3 commented 5 months ago

I installed on Linux from here: https://huggingface.co/microsoft/speecht5_tts and run it with script provided on point 3 "Run inference via the Transformers modelling code"

Is there any ready to use MS TTS model of British English?

If yes, please tell me how to change the script to get it work.

All these model names and descriptions are very technical. Non technical users like me don't what it is about. I just want to use offline TTS. I am happy that you share it, but please give more description how to use it.

Additional question: does this speecht5_tts english voice have something in common with SAPI 5 free voices?

nmstoker commented 2 months ago

This seems to me slightly more an HF code question than a SpeechT5 issue but I think I can help. so dropping a few pointers below 🙂

Not sure if you noticed, but on the link you provided it refers to the speaker embeddings / xvectors, and the code uses a particular example from the Matthijs/cmu-arctic-xvectors dataset. That gives the general quality of the output speech.

In the example code, it's using embeddings_dataset[7306] but if you switch to another value you'll get other speakers in the dataset. There is a Scottish speaker (ie British) in there, I don't recall the Id range you need off hand (note: the Ids are not per speaker, I think they're per xvector/per recording and there are several from each speaker, so 7305 is the same speaker as 7306, although the quality / style can vary a little). Exploring the dataset on HF (as per the link above) will help you find suitable Ids a bit quicker as each record has details of the accent - for the Scottish speaker look for "awb" in the filename.

If you want other accents not in the embeddings dataset, you can search for other xvectors online and use those (as per the comment in the code:

# You can replace this embedding with your own as well.

YMMV but a bit of a Google should work for this and there's most likely software for extracting xvectors from audio samples.