r9y9 / wavenet_vocoder

WaveNet vocoder
https://r9y9.github.io/wavenet_vocoder/
Other
2.32k stars 499 forks source link

Ask about Phoneme Segmentation and Phoneme Duration #12

Closed toannhu closed 6 years ago

toannhu commented 6 years ago

Hi, @r9y9. First of all, thank you for such a brilliant implementation of Wavenet. Now I study about how to detect a phoneme duration (start time and end time of phoneme) extracted from audio and align this thing with linguistic feature but i don't know how to do this. Can you show me the idea that you use to solve the problem and where is the code in this repo to do this job? Thanks!

1

P/s: Btw is this possible to train this repo with another language? Currently, I'm doing with Vietnamese with my own dataset (7 hours of audio and ARPABET linguistic feature extracted from text)

r9y9 commented 6 years ago

The repository focuses on the WaveNet vocoder as the name says. It doesn't provide any of phoneme duration estimation and linguistic features extraction, which are needed to replicate original WaveNet-based TTS. The vocoder can take arbitrary type of input assuming time resolution is adjusted, though.

Linguistic feature extraction (a.k.a text processing frontend) is the hard part of TTS, which often requires deep knowledge for the target language. The WaveNet vocoder itself is language independent but you will have to implement a text processing frontend if you want to condition the model by linguistic features.

imdatceleste commented 6 years ago

@toannhu , you might be interested in Aeneas if you are only looking for phoneme detection. It is not meant for phonemes but by adjusting various parameters, it might be helpful in understanding how to do what you want to do.

toannhu commented 6 years ago

@r9y9 @imdatsolak Thanks for support. I have found Montreal Forced Aligner Tool that help me with this problem. As I can see in this repo, @r9y9 uses another library nnmnkwii to do the frontend things. Please excuse my ignorance but can you explain for me what the input (after do the frontend things) feed to Wavenet vocoder for local condition? It's very helpful to know the basic ideas about how Wavenet vocoder works. I'm really confused when read this repo's code. Once again thank you!

r9y9 commented 6 years ago

There's no text processing frontend used in the repository. nnmnkwii has functionality to extract linguistic features from HTS-style context labels, though. In this repository nnmnkwii is used for mostly preprocesssing. e.g, mulaw or inv_mulaw. https://r9y9.github.io/nnmnkwii/latest/references/preprocessing.html

The WaveNet class in the repository doesn't assume any particular domain of the conditional features, but training / pre-processing scripts are written assuming mel-spectrogram is used for the conditional feature.

toannhu commented 6 years ago

@r9y9 Thanks for enlight me. Finally I got the key thing. One more question, is this possible to use this Wavenet Vocoder repo with Tacotron? Do you plan to do this thing in the future? Any idea suggestion?

r9y9 commented 6 years ago

Definitely it's possible. Tacotron2-like wavenet vocoder is WIP at https://github.com/r9y9/deepvoice3_pytorch/pull/21.

See also https://github.com/r9y9/wavenet_vocoder/issues/1#issuecomment-359182424.

r9y9 commented 6 years ago

Taoctron + WaveNet was done

toannhu commented 6 years ago

@r9y9 Thanks. I succeed in training Rayhane-mamah's Tacotron 2 repo with my own corpus. Going to try out generate with your WaveNet Vocoder. I'm eager to hear the result. This makes me very exciting!

toannhu commented 6 years ago

@r9y9 I tried to intergrate Rayhane-mamah's Tacotron 2 with WaveNet Vocoder using your Google Colab's code but it failed. The pitch has been lost. Btw, I use batch_size = 16 and r=2 in Tacotron 2 and batch_size = 1 in your repo, everything else is default. Wavenet Repo is training with original sound not with GTA. Here are the results. sound.zip