lucidrains / voicebox-pytorch

Implementation of Voicebox, new SOTA Text-to-speech network from MetaAI, in Pytorch
MIT License
562 stars 45 forks source link

Training Example #25

Open YKoustubhRao opened 10 months ago

YKoustubhRao commented 10 months ago

The training example given seems to be missing the mask vector? In the paper the input to the model was the audio, mask and the phoneme sequence (which was aligned to the audio in the previous implementation of this repo).

So where are the mask vectors and the phoneme sequence used in the training?

Thank You and great appreciation for all you have done.

lucidrains commented 10 months ago

can you screenshot or paste the relevant section of the paper for said mask?

lucidrains commented 10 months ago

I'm introducing spear tts conditioning, proven out in the soundstorm repository, and bypassing duration, phoneme, alignment stuff.

YKoustubhRao commented 10 months ago

image.

Alright, I will read up about Spear TTS. Could you tell me what the 'cond' variable actual mean with respect to an audio and transcript?

And we might have to use a different TTS for other languages for the alignment.

Thank You

YKoustubhRao commented 10 months ago

image

lucidrains commented 10 months ago

@YKoustubhRao thanks for the screenshot

i've decided to automatically manage the condition if you were to pass in the binary temporal mask as they said in 3.2, as cond_mask. it will also be auto generated during training. during inference, you would construct the condition as to zero out the section you would like to infill

lucidrains commented 10 months ago

@YKoustubhRao i will get the phoneme / duration / aligner stuff finished by end of week along with some training code

YKoustubhRao commented 9 months ago

Is there a pipeline for denoising and zero shot tts? @lucidrains

blldd commented 9 months ago

Hello lucidrains, can you share your training script and data preparation code to make it easier to try? Thanks in advance.

kdcyberdude commented 8 months ago

Any updates on this?

nrailg commented 7 months ago

Any updates on this?

Same question.

lucidrains commented 7 months ago

ah, the code is all in there and @lucasnewman has already trained models successfully. i'll update the readme by end of week

Subarasheese commented 7 months ago

ah, the code is all in there and @lucasnewman has already trained models successfully. i'll update the readme by end of week

Hello. Will the weights be released?

Thank you

lucasnewman commented 7 months ago

ah, the code is all in there and @lucasnewman has already trained models successfully. i'll update the readme by end of week

Hello. Will the weights be released?

Thank you

Hey all, there's a small pretrained model available in this discussion thread: https://github.com/lucidrains/voicebox-pytorch/discussions/29#discussioncomment-7732769

All the training code is in the repo and I put the details for the training hyperparams in the thread, so training your own model should be as straightforward as instantiating the models, dataset, and trainer and calling train() -- if you're having issues, report back and I can try to help.

clcarwin commented 7 months ago

@lucasnewman Thanks for your hyperparams and pretrained model. It can achieve acceptable results with a batch size of 32 and 100k step on a 4090 GPU.

shigabeev commented 6 months ago

@lucasnewman Thanks for your hyperparams and pretrained model. It can achieve acceptable results with a batch size of 32 and 100k step on a 4090 GPU.

Hey, can you send us sound samples?

wassimseif commented 6 months ago

@shigabeev, @lucasnewman has some voice samples in the repo, You should be able to reproduce the same results. If you still need samples let me know, I might be able to send you some

shigabeev commented 6 months ago

@shigabeev, @lucasnewman has some voice samples in the repo, You should be able to reproduce the same results. If you still need samples let me know, I might be able to send you some

Yeah, I found his trained model on HF, it sounds pretty good. However I wasn't able to figure out how to run in text conditioned mode (TTS). Can you show me the way to do it? Or can you just send some of your audio samples with TTS?