neosapience / mlp-singer

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis (IEEE MLSP 2021)
MIT License
116 stars 27 forks source link

questions about mels feature and english durations #2

Closed Liujingxiu23 closed 3 years ago

Liujingxiu23 commented 3 years ago

Thank you for you great job and sharing. I am a beginer in svs. I have two questions:

  1. mel-feature-extract: For the MLP-based acoustic model training, "data/dsp/core.py" is used for extract mels? For the hifigan vocoder model training, "hifi-gan/meldataset.py" is used for extract mels? I see the code are quite different, which code did you use for mels?
  2. duration For Korean, you used 3-frames for onset and coda, the remaining for the vowel. Did you have any experience or suggestion for other languages, for example, english, chiense.
jaketae commented 3 years ago

Hey @Liujingxiu23, thanks for filing this issue, and apologies for the belated reply.

  1. We used data/dsp/core.py for mel extraction, but I think either should be fine. In retrospect, it might even make more sense to use HiFi-GAN's codebase if you are going to use HiFi-GAN as the vocoder. But they should be doing the same thing, as NVIDIA Tacotron's mel-spectrograms are perfectly compatible with HiFi-GAN checkpoints. As a sanity check, I'd just make sure that mel-spectrogram values lie within some plausible range, i.e. around [-11, 2].
  2. We didn't use this model to run experiments on other languages. However, my intuition is that a similar setup should still work. Namely, if you can decompose a word into syllables, then further break down that syllable into an onset, nucleus, and coda, the 3-rest-3 setup seems fine. For example, given 汉, you could give it 3 h's, 3 n's, and all a's in between.

Let me know if there are any lingering questions!

Liujingxiu23 commented 3 years ago

Thank you for your reply.

  1. I tried to use "hifi-gan/meldataset.py" to extract mels and then waves can be synthesized successfully, so I use this code.
  2. I tried to train the English data in CSD, I set length_c=5 since I anlysize one English TTS dataset, the average duration of consonant is 5. The synthesized songs are of the similar quality as that of Korean. I have not tried Chinese since I have found any avaiable dataset.

I have a question of the SVS dataset. In the CSD dataset, one syllable corresponds to one note, right? Is this because the songs are of relatively simply melody? For common songs, for example, pop music, one syllable may corresponds to several notes, right?

jaketae commented 3 years ago

Hey @Liujingxiu23,

I tried to use "hifi-gan/meldataset.py" to extract mels and then waves can be synthesized successfully, so I use this code.

Great!

I tried to train the English data in CSD, I set length_c=5 since I anlysize one English TTS dataset, the average duration of consonant is 5. The synthesized songs are of the similar quality as that of Korean. I have not tried Chinese since I have found any avaiable dataset.

Very interesting, thanks for the confirmation. Glad to hear that English somewhat worked.

In the CSD dataset, one syllable corresponds to one note, right? Is this because the songs are of relatively simply melody? For common songs, for example, pop music, one syllable may corresponds to several notes, right?

Yes, you're right. Normally, it's not the case that one syllable equals one note. CSD was annotated very specifically to satisfy this constraint. So if you want to apply this to pop songs, there would be some hurdles.

jaketae commented 3 years ago

Closing this for now. Please feel free to open another issue if you have any further questions!