p0p4k / pflowtts_pytorch

Unofficial implementation of NVIDIA P-Flow TTS paper
https://neurips.cc/virtual/2023/poster/69899
MIT License
198 stars 28 forks source link

Make it multi-language? #2

Open zidsi opened 8 months ago

zidsi commented 8 months ago

I was wondering if "injecting" language info would be possible. Something similar to what xtts is doing by injecting special language token e.g. [en] for GPT input.

Features from 3-sec speech prompt might not be enough (nor desired) to capture language of sample text (in order to do cross language speaker cloning). However concatenating "speech prompt" with some kind of language id (precomputed language features vector?) might enable ML (as multi-language) in addition to MS.

At inference changing this prompt part might enable inline language switching.

There might be better way of course. E.g. passing info directly to encoder PreNet? Anyway it wold be great to see this feature. VITS based YourTTS does similar thing.

p0p4k commented 8 months ago

I think it is possible to do it. Ill do it after I am sure this version of the model works at least for one language.

zidsi commented 8 months ago

LJSpeech sample sounds promissing. Will you be able to reuse weights for multi speaker (VCTK?) training? If "yes" I'll start training for single speaker dataset (non English).

p0p4k commented 8 months ago

Yes, can reuse.

zidsi commented 8 months ago

According to RADMMM title of issue/wish should be Make it Multiaccented. Authors say:"We refer to our conditioning as accent instead of language, because we consider language to be implicit in the phoneme sequence. " But let's first see how well 3sec conditioning works for multispeaker.

p0p4k commented 8 months ago

True, I am doing a multi-speaker training on my end as well, let's see if the generations are good enough without extra conditioning first. Good luck!

vuong-ts commented 7 months ago

Does the training of multi-speaker (VCTK) look good @p0p4k ?

rafaelvalle commented 7 months ago

VCTK should work but it should be easier to fit LibriTTS. The main issue with VCTK is that it there's a lot of silence at the beginning and end of some samples and automatic trimming methods are normally not accurate and end up trimming phonemes. Accent and language control should be possible with one hot embeddings. VCTK and CML-Dataset are great candidates.

p0p4k commented 7 months ago

LibriTTS sounds like this @ 200k steps with guided sampling - https://voca.ro/1e0tSbWgbyuu

rishikksh20 commented 7 months ago

@p0p4k sample sounds good, I think with more training it will getting lot better. I think multi-linguility is easy to implement in this repo. I think problem occurs when you use one language native speaker prompt and generate other language speech.

p0p4k commented 7 months ago

On the other note, can adding some noise in the prompt help the model to extract "voice" better? Since I tried a zero-shot voice clone and it didn't perform that well.