openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.64k stars 275 forks source link

Notes below C3 are breathy #73

Closed hilmiyafia closed 1 year ago

hilmiyafia commented 1 year ago

Hello, I have trained a model. It sounds amazing on notes above C3, but below C3 are very breathy. When I checked on tensorboard, it seems the ground truth has become breathy as well. The original wav files are not. I think there's something wrong on the binarization code or the vocoder.

yqzhishen commented 1 year ago

Since the tensorboard ground truth samples are results of copy-synthesis of the vocoder, it is most likely that the vocoder breaks the samples. You may perform a vocoder-only inference using inference/vocoder/val_nsf_hifigan.py.

For the root cause, I guess it is the f0 extractor which caused this problem, and you may verify it by switching between parselmouth and torchcrepe. The torchcrepe method is not supported in binarizer yet, but it is usable in the script mentioned above.

Performance of the model also depends on your pitch distribution. Noise, reverbs and some extreme singing styles may also break the f0 extraction. If you haven't got much data covering that low pitch, these errors will mislead the model, and it is not likely to sing well, too.

If you are willing to share, you may post some useful material here:

hilmiyafia commented 1 year ago

@yqzhishen Thank you for the reply, here I attached the samples: Samples.zip

The data is normalized to around -24 dB, so it you need to increase the volume to hear it.

Prefixes:

I included 3 samples, one is LOW where the notes are low, you can hear a lot of breathiness, then MEDIUM where the breathiness appears at some low notes, and last is HIGH where there's no breathiness.

From the binarization process, it tells me the training data is 4769.98 seconds (about 1 hour 20 minutes), and the validation data is 40 seconds. And here is the pitch distribution plot:

pitch_distribution

yqzhishen commented 1 year ago

The audio quality of your dataset does not seem ideal, though. The vocoder is trained on a large amount of high-quality singing data, and it is truly possible to fail on your data. However, how many steps have you trained your model, and how about the batch size?

I will look into this issue further once I got time.

hilmiyafia commented 1 year ago

@yqzhishen Do you have any example where the vocoder is successful in copy-synthesizing low notes high-quality singing data? Like around C2 - G2? I think the example would be helpful for users to know what to aim for.

My diffsinger is trained for 77000 steps. I don't remember the batch size because it changes. I think the batch size calculated automatically. But I doubt that training the diffsinger has any effect, since the ground truth itself is broken. As you said, it is the vocoder that is not suitable for my data. The only solution I can think of is fine tuning the vocoder. Do you know if there is any guide how to re-train the nsf-hifigan on my data?

yqzhishen commented 1 year ago

NSF-HiFiGAN is a heavy-weight vocoder that is not possible to train personally. Based on my own experience, the current pretrained vocoder can perform well on most cases, including low pitch data, but sometimes it will fail. If you really want a custom vocoder, you may train a DDSP vocoder according to the README and instructions in the repository yxlllc/pc-ddsp.

hilmiyafia commented 1 year ago

@yqzhishen Oh! It's okay, I found an nsf-hifigan model trained by fish audio. It has better performance on the low notes. Thank you 😊