openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.62k stars 275 forks source link

Implement pitch expressiveness controlling mechanism #97

Closed yqzhishen closed 11 months ago

yqzhishen commented 1 year ago

Expressiveness controls how freely the variance model generates pitch curves. By default, the variance model predicts pitch at a 100% expressiveness, which means completely following the style of the voice provider. Correspondingly, a 0% expressiveness will produce pitch completely close to the smoothened music score. Expressiveness can be freely adjusted from 0% to 100%, statically, or even dynamically on frame level.

The mechanism of expressiveness is a trick on retake_embed. Regions where retake == 1 (100% expressiveness) will generate pitch as normal, while those where retake == 0 (0% expressiveness) will return the given base_pitch that represents the music score. When a linear fusion is applied on the two types of embeddings, we get the effects of an expressiveness curve with continuous values between 0 and 1.