Fix D4C waveform decompositioning threshold (improves sound quality of variance models)

lottev1991 commented 7 months ago

Hello all,

Recently, users from the DiffSinger community have been experimenting with lowering the threshold of the D4C waveform decompositioning step as found in binarizer_utils.py. The default setting for this is quite high, which can cause the following issues in models using variance parameters (tension and voicing in particular):

Decreased audio quality. I myself had trained one model with the tension parameter enabled, and one without. The tension model had a not insignificant reduction in audio quality, which was not present in the model that did not use it. I had trained both the acoustic and variance models to their maximum steps, but the sound quality never improved.
Devoicing of vowel sounds, especially when tension and voicing are trained together. With the current default settings, oftentimes vowels tend to be incorrectly recognized as being unvoiced sounds, which causes very strange gaps on long notes, reducing the quality of the model even more.

I've set the current threshold value at 0.25; there have been suggestions from the community to put an even lower value, though I have not tested that myself. The above-mentioned value has already significantly improved the quality of my latest model, which does support the tension parameter. This improvement in quality so far seems to be consistent across the board, with multiple positive reports from users so far. This is why I think it's a good idea that a lower threshold becomes the new default during waveform decomposition.

Initial findings were done by @UtaUtaUtau, who had this to say about it:

The D4C step in the waveform decomposition class could be prone to devoicing vowels because the default threshold is pretty high. I would know from experience with developing a WORLD-based UTAU resampler, and a few voicebanks get this issue because of that high threshold. I'd recommend passing threshold=0.25 in it as I found that value pretty decent at avoiding accidental vowel devoicing, although I didn't do any rigorous testing for that threshold. I'm just pointing it out because WORLD might react differently from actual singing samples versus UTAU recording samples...

Regards,

Lotte V

yqzhishen commented 7 months ago

We have done some experiments on the parameter, but no observable difference was found between the default threshold and your proposed value.

Perhaps we should collect more information on this issue. For example, which PE are you using, or which PEs have you tried? Will different PEs matter on this? Currently in our Chinese comminity most people including us use RMVPE, and there is yet no evidence to indicate that the threshold (or tension itself) can affect the quality. I hope you (and other people, as well) can provide more experimental results before we determine whether to modified the settings, and how.

Changing a parameter is not an easy thing. For example, if there are not many cases to support the change, we would rather make it a user-defined configuration than hard-encoding it; if the influence is wide and significant, then we can consider changing it directly in the code; otherwise, the default value tuned by the library author should still be preferred.

yqzhishen commented 5 months ago

A better harmonic-noise separation algorithm is instroduced in #196, together with an important bug fix for array padding when the recording is longer than the label.

The issue described in this PR can be simply bypassed by using the new algorithm. Feel free to raise a new issue if new problems occur.

openvpi / DiffSinger

Fix D4C waveform decompositioning threshold (improves sound quality of variance models) #187