Closed JieLuChen closed 7 months ago
We only use RhythmRegulator here because we have no extra information about the phoneme sequence; we do not know which are vowels and which are consonants. The only thing we know is the word-level phoneme division (the ph_num
). If your suggested implementation require extra information, e. g. phoneme categories, then it may not be suitable for this repository. Introducing these things will make this repository too complicated, but I want it simple and focusing on algorithms.
By the way, the higher rhythm_corr
you get on TensorBoard, the less impact will the forced alignment bring to the durations. In one word, however, the CLI inference is only for basic tests of the model, and it is the OpenUTAU's task to deal with further deployments.
So OpenUTAU does the following alignment --> aligning borders between consonant and vowels to notes Does ds_variance.py also do this implicitly with LengthRegulator and RhythmRegulator?
Or are these two completely different approaches and which one yields better results? Or is it that OpenUTAU (as it is post inferred timings it is an extra alignment process?)
If it is an extra alingment process (in openUTAU) can this be applied to ds_variance.py?
UPDATE So it looks like ds_variance.py uses the RhythmRegulator() to correct imings generated by the duration model. But it seems like this does not perform as well as timings alignment as done here (aligning borders between consonant and vowels to notes) https://github.com/xunmengshe/OpenUtau-phonemizers/blob/master/EnunuOnnxPhonemizer/EnunuOnnxPhonemizer.cs
@yqzhishen are you able to implement this timings alignment in ds_variance.py or update RhythmRegulator()?