Implement pitch expressiveness controlling mechanism

Expressiveness controls how freely the variance model generates pitch curves. By default, the variance model predicts pitch at a 100% expressiveness, which means completely following the style of the voice provider. Correspondingly, a 0% expressiveness will produce pitch completely close to the smoothened music score. Expressiveness can be freely adjusted from 0% to 100%, statically, or even dynamically on frame level.

The mechanism of expressiveness is a trick on retake_embed. Regions where retake == 1 (100% expressiveness) will generate pitch as normal, while those where retake == 0 (0% expressiveness) will return the given base_pitch that represents the music score. When a linear fusion is applied on the two types of embeddings, we get the effects of an expressiveness curve with continuous values between 0 and 1.

openvpi / DiffSinger

Implement pitch expressiveness controlling mechanism #97