An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Expressiveness controls how freely the variance model generates pitch curves. By default, the variance model predicts pitch at a 100% expressiveness, which means completely following the style of the voice provider. Correspondingly, a 0% expressiveness will produce pitch completely close to the smoothened music score. Expressiveness can be freely adjusted from 0% to 100%, statically, or even dynamically on frame level.
The mechanism of expressiveness is a trick on retake_embed. Regions where retake == 1 (100% expressiveness) will generate pitch as normal, while those where retake == 0 (0% expressiveness) will return the given base_pitch that represents the music score. When a linear fusion is applied on the two types of embeddings, we get the effects of an expressiveness curve with continuous values between 0 and 1.
Expressiveness controls how freely the variance model generates pitch curves. By default, the variance model predicts pitch at a 100% expressiveness, which means completely following the style of the voice provider. Correspondingly, a 0% expressiveness will produce pitch completely close to the smoothened music score. Expressiveness can be freely adjusted from 0% to 100%, statically, or even dynamically on frame level.
The mechanism of expressiveness is a trick on
retake_embed
. Regions whereretake == 1
(100% expressiveness) will generate pitch as normal, while those whereretake == 0
(0% expressiveness) will return the givenbase_pitch
that represents the music score. When a linear fusion is applied on the two types of embeddings, we get the effects of an expressiveness curve with continuous values between 0 and 1.