lucidrains / voicebox-pytorch

Implementation of Voicebox, new SOTA Text-to-speech network from MetaAI, in Pytorch
MIT License
589 stars 49 forks source link

Using WhisperSpeech Pre-trained Weights for TextToSemantic #49

Closed EomSooHwan closed 5 months ago

EomSooHwan commented 6 months ago

I believe that whisperspeech uses Spear-TTS.

I want to use the pre-trained weights from the above huggingface link, but I don't know how exactly.

The config keys for t2s models are as follows ["depth", "n_head", "head_width", "ffn_mult", "stoks_width", "ttoks_width", "ttoks_len", "stoks_len", "ttoks_codes", "stoks_codes"]

However, I find the variables for TextToSemantic are slightly different, which makes it confusing if it is okay to use them.

Can anybody help me with this issue?

I first wanted to solve this in the discussions page, but the page seems inactive, so I apologize in advance for uploading this here.