openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.63k stars 275 forks source link

question about MIDI and F0 #96

Closed lvZic closed 8 months ago

lvZic commented 1 year ago
  1. I wonder how to get F0 truth in practice ? In my opinion, MIDI could be labeled mannually, and the F0 truth are usually got from human voice? And i wonder why you focus on midiless version in future, the midiless performance better than midi one or the midi lable is harder to obtain than F0?
  2. In my test, the ph_dur estimation is poor, when test your midi-A version. If given the ph_dur GT as input, the performance is improved a lot. However the performance in midi-B test is always better than midi-A test . I am new to SVS, sorry for above questions.
yqzhishen commented 1 year ago
  1. Advantages of doing so are discussed in #67. MIDI is harder to label, and decoupling MIDI from acoustic models can improve the controllability and performance. In the future, pitch (f0) can be predicted from lyrics and MIDI via variance models.
  2. Old MIDI-A and MIDI-B both have duration predictors; there is no difference in their duration prediction method and logic.
lvZic commented 1 year ago

got it. I just made a ds_e2e.py test between original version and openvpi version. All the two versions used the same ckpts as following: load 'model' from 'checkpoints/0831_opencpop_ds1000/model_ckpt_steps_320000.ckpt'. | load 'model' from 'checkpoints/0102_xiaoma_pe/model_ckpt_steps_60000.ckpt'. | load HifiGAN: checkpoints/0109_hifigan_bigpopcs_hop128/model_ckpt_steps_1512000.ckpt however, the original result sounds better than openvpi's result. Have u made a comparision with original Diffsinger, or i could make a mistake...

yqzhishen commented 1 year ago

0831 is a really old checkpoint. At that time we haven't gone far from the original repo.