Closed Liujingxiu23 closed 3 years ago
Hey @Liujingxiu23, thanks for filing this issue, and apologies for the belated reply.
data/dsp/core.py
for mel extraction, but I think either should be fine. In retrospect, it might even make more sense to use HiFi-GAN's codebase if you are going to use HiFi-GAN as the vocoder. But they should be doing the same thing, as NVIDIA Tacotron's mel-spectrograms are perfectly compatible with HiFi-GAN checkpoints. As a sanity check, I'd just make sure that mel-spectrogram values lie within some plausible range, i.e. around [-11, 2]
.Let me know if there are any lingering questions!
Thank you for your reply.
I have a question of the SVS dataset. In the CSD dataset, one syllable corresponds to one note, right? Is this because the songs are of relatively simply melody? For common songs, for example, pop music, one syllable may corresponds to several notes, right?
Hey @Liujingxiu23,
I tried to use "hifi-gan/meldataset.py" to extract mels and then waves can be synthesized successfully, so I use this code.
Great!
I tried to train the English data in CSD, I set length_c=5 since I anlysize one English TTS dataset, the average duration of consonant is 5. The synthesized songs are of the similar quality as that of Korean. I have not tried Chinese since I have found any avaiable dataset.
Very interesting, thanks for the confirmation. Glad to hear that English somewhat worked.
In the CSD dataset, one syllable corresponds to one note, right? Is this because the songs are of relatively simply melody? For common songs, for example, pop music, one syllable may corresponds to several notes, right?
Yes, you're right. Normally, it's not the case that one syllable equals one note. CSD was annotated very specifically to satisfy this constraint. So if you want to apply this to pop songs, there would be some hurdles.
Closing this for now. Please feel free to open another issue if you have any further questions!
Thank you for you great job and sharing. I am a beginer in svs. I have two questions: