Does ds_variance.py align phoneme durations to notes like how it is done in OpenUTAU?

openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Apache License 2.0

2.62k stars 275 forks source link

So OpenUTAU does the following alignment --> aligning borders between consonant and vowels to notes Does ds_variance.py also do this implicitly with LengthRegulator and RhythmRegulator?

Or are these two completely different approaches and which one yields better results? Or is it that OpenUTAU (as it is post inferred timings it is an extra alignment process?)

If it is an extra alingment process (in openUTAU) can this be applied to ds_variance.py?

UPDATE So it looks like ds_variance.py uses the RhythmRegulator() to correct imings generated by the duration model. But it seems like this does not perform as well as timings alignment as done here (aligning borders between consonant and vowels to notes) https://github.com/xunmengshe/OpenUtau-phonemizers/blob/master/EnunuOnnxPhonemizer/EnunuOnnxPhonemizer.cs

@yqzhishen are you able to implement this timings alignment in ds_variance.py or update RhythmRegulator()?

We only use RhythmRegulator here because we have no extra information about the phoneme sequence; we do not know which are vowels and which are consonants. The only thing we know is the word-level phoneme division (the ph_num). If your suggested implementation require extra information, e. g. phoneme categories, then it may not be suitable for this repository. Introducing these things will make this repository too complicated, but I want it simple and focusing on algorithms.

By the way, the higher rhythm_corr you get on TensorBoard, the less impact will the forced alignment bring to the durations. In one word, however, the CLI inference is only for basic tests of the model, and it is the OpenUTAU's task to deal with further deployments.

openvpi / DiffSinger

Does ds_variance.py align phoneme durations to notes like how it is done in OpenUTAU? #147