openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.68k stars 283 forks source link

Question regarding pitch models (Reflow vs DDPM) #193

Open ariikamusic opened 4 months ago

ariikamusic commented 4 months ago

Hello,

I have a question regarding the current pitch models, specifically the differences between Reflow and DDPM. With the latest update, it seems like Reflow has become the new default and recommended setting for training acoustic and variance models. While Reflow is very fast—faster than DDPM—it appears to be at the cost of quality.

I've conducted multiple experiments with my dataset of three speakers (a soprano, a mezzo-soprano, and a tenor), each with approximately three hours of Japanese singing data, using the multispeaker method. Unfortunately, the experiments using Reflow for the pitch models have been inconsistent in my experience. The speakers are all very expressive and stylized in their singing, which is rarely reflected in the results. I've tried different batch sizes, maximum steps, step sizes, and switched between L1 and L2 loss functions, but none of these adjustments have produced the desired results. Specifically, I find that Reflow does not accurately replicate the singers' styles. The resulting F0 is relatively flat, with little variation or randomness, and the singing style feels "safe" with minimal vibrato, even when the singer uses vibrato frequently.

On the other hand, experiments using DDPM have yielded much clearer and more accurate results, better replicating the singers' styles. It seems to me that DDPM trains more carefully compared to Reflow.

My question is: What could be the reason for this difference in results between these two diffusion types? Might DDPM be more suited for highly stylized and random singing, especially when using L2 loss for bigger outliers? Is Reflow more suited for singing that is less random?

Thank you in advance.

yqzhishen commented 4 months ago

In our experiments, Reflow outperforms DDPM a lot on all types of datasets, especially for expressive ones. Furtherly, Reflow can hold worse (automatic) labels and more data/speakers. Thus, your case seems unexpected, and there may be other cause before blaming Reflow itself.

There are many factors which influence the pitch performance, like your training steps, your labels, your combination of variance modules, your choice of speedup/steps, or even your method of testing. For research purposes, I recommend reading the accuracy metrics and validation plots on TensorBoard, or using the CLI inference script in this repository. (There were cases where someone put a multi-speaker pitch model into OpenUTAU with misconfigured YAML, and the software produced wrong results without any error reports.)

Therefore, if you still cannot figure out the reason, please provide more details, for example:

ariikamusic commented 3 months ago

Hello.

Thank you for your response, it is much appreciated.

After doing more experiments, and also comparing the result with inference via command, ReFlow outperforms DDPM a lot. For some reason, the result is very different when it is generated in OpenUTAU. I did make sure the config for OpenUTAU was configured correctly, though. I wonder why it is. My apologies for blaming Reflow at first, when the issue is most likely OpenUTAU, or onnx exporting wrongly.

Thank you in advance.

yqzhishen commented 3 months ago

A possible debugging method is to freeze one speaker into the model and test it in OpenUTAU. OpenUTAU encountered problems in multi-speaker cases for many times before. There are possible bugs that the result seemed okay but actually the model did not run correctly at all.

Also, it is not likely an ONNX bug if you exported the model with PyTorch 1.13 successfully, because there are other people who are using multi-speaker pitch models in OpenUTAU and can get reasonable results.

Maybe you still need to check the configuration carefully. OpenUTAU has too many undefined behaviors that can break the results without any error reporting, and only if you do everything as it expects that you can get the right outputs.