Closed zhangsanfeng86 closed 2 years ago
what's the meaning of "phoneme.onset, phoneme.nucleus, phoneme.coda"? https://github.com/neosapience/mlp-singer/blob/b6a546fc6fbeb17a6220320a596ea1542ee3e509/data/preprocess.py#L91
Hello @zhangsanfeng86, thanks for opening this issue.
The code you've referenced can only be used on Korean. The preprocessing code basically divides a syllable into three components: onset, nucleus, and coda. There are plenty of resources such as this Wiki entry that explains them much better than I ever would, but to give you a high level overview, it's a method of decomposing the word "cat" into "c" (onset), "a" (nucleus), and "t" (coda).
Back to your original question on English, you would need a different grapheme-to-phoneme algorithm to normalize and preprocess the text. This isn't technically required, but it will help a lot with accurate pronunciation. There are a lot of open source implementations of TTS models that have normalization and g2p pipelines, such as Tacotron 2.
Please let me know if you have any more questions or comments!
Thank you for your reply , another question is that ,why train_batch_size =384, as usual ,batch_size set to 16 or 32
384 happened to be the largest batch size that I could fit on my work environment (24GB of VRAM). Feel free to modify it as you see fit.
Hi @jaketae, If training dataset si English, how to modify the "encoder" ? https://github.com/neosapience/mlp-singer/blob/b6a546fc6fbeb17a6220320a596ea1542ee3e509/data/g2p.py#L162