neosapience / mlp-singer

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis (IEEE MLSP 2021)
MIT License
116 stars 28 forks source link

How to modify encoder, if dataset is in English #5

Closed zhangsanfeng86 closed 2 years ago

zhangsanfeng86 commented 2 years ago

Hi @jaketae, If training dataset si English, how to modify the "encoder" ? https://github.com/neosapience/mlp-singer/blob/b6a546fc6fbeb17a6220320a596ea1542ee3e509/data/g2p.py#L162

zhangsanfeng86 commented 2 years ago

what's the meaning of "phoneme.onset, phoneme.nucleus, phoneme.coda"? https://github.com/neosapience/mlp-singer/blob/b6a546fc6fbeb17a6220320a596ea1542ee3e509/data/preprocess.py#L91

jaketae commented 2 years ago

Hello @zhangsanfeng86, thanks for opening this issue.

The code you've referenced can only be used on Korean. The preprocessing code basically divides a syllable into three components: onset, nucleus, and coda. There are plenty of resources such as this Wiki entry that explains them much better than I ever would, but to give you a high level overview, it's a method of decomposing the word "cat" into "c" (onset), "a" (nucleus), and "t" (coda).

Back to your original question on English, you would need a different grapheme-to-phoneme algorithm to normalize and preprocess the text. This isn't technically required, but it will help a lot with accurate pronunciation. There are a lot of open source implementations of TTS models that have normalization and g2p pipelines, such as Tacotron 2.

Please let me know if you have any more questions or comments!

zhangsanfeng86 commented 2 years ago

Thank you for your reply , another question is that ,why train_batch_size =384, as usual ,batch_size set to 16 or 32

jaketae commented 2 years ago

384 happened to be the largest batch size that I could fit on my work environment (24GB of VRAM). Feel free to modify it as you see fit.