openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.73k stars 288 forks source link

Fix D4C waveform decompositioning threshold (improves sound quality of variance models) #187

Closed lottev1991 closed 5 months ago

lottev1991 commented 7 months ago

Hello all,

Recently, users from the DiffSinger community have been experimenting with lowering the threshold of the D4C waveform decompositioning step as found in binarizer_utils.py. The default setting for this is quite high, which can cause the following issues in models using variance parameters (tension and voicing in particular):

I've set the current threshold value at 0.25; there have been suggestions from the community to put an even lower value, though I have not tested that myself. The above-mentioned value has already significantly improved the quality of my latest model, which does support the tension parameter. This improvement in quality so far seems to be consistent across the board, with multiple positive reports from users so far. This is why I think it's a good idea that a lower threshold becomes the new default during waveform decomposition.

Initial findings were done by @UtaUtaUtau, who had this to say about it:

The D4C step in the waveform decomposition class could be prone to devoicing vowels because the default threshold is pretty high. I would know from experience with developing a WORLD-based UTAU resampler, and a few voicebanks get this issue because of that high threshold. I'd recommend passing threshold=0.25 in it as I found that value pretty decent at avoiding accidental vowel devoicing, although I didn't do any rigorous testing for that threshold. I'm just pointing it out because WORLD might react differently from actual singing samples versus UTAU recording samples...

Regards,

Lotte V

yqzhishen commented 7 months ago

We have done some experiments on the parameter, but no observable difference was found between the default threshold and your proposed value.

Perhaps we should collect more information on this issue. For example, which PE are you using, or which PEs have you tried? Will different PEs matter on this? Currently in our Chinese comminity most people including us use RMVPE, and there is yet no evidence to indicate that the threshold (or tension itself) can affect the quality. I hope you (and other people, as well) can provide more experimental results before we determine whether to modified the settings, and how.

Changing a parameter is not an easy thing. For example, if there are not many cases to support the change, we would rather make it a user-defined configuration than hard-encoding it; if the influence is wide and significant, then we can consider changing it directly in the code; otherwise, the default value tuned by the library author should still be preferred.

yqzhishen commented 5 months ago

A better harmonic-noise separation algorithm is instroduced in #196, together with an important bug fix for array padding when the recording is longer than the label.

The issue described in this PR can be simply bypassed by using the new algorithm. Feel free to raise a new issue if new problems occur.