openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.69k stars 283 forks source link

Rhythmizers for other languages #62

Closed haru0l closed 1 year ago

haru0l commented 1 year ago

Hi again, I was wondering if there were any documentation regarding the rhythmizers, as I would like to train them for Japanese...

Side question, would a rhythmizer work on CVVC languages such as English or Polish?

yqzhishen commented 1 year ago

Rhythmizers are actually temporary solution of phoneme duration prediction for MIDI-less models. A rhythmizer contains the FastSpeech2 encoder module and the DurationPredictor module from MIDI-A mode.

Models of MIDI-A mode can predict phoneme durations well and generate nice spectrograms, but their datasets are hard to label (you need MIDI sequence and slurs), and they have poor ability to predict the pitch, although they do have PitchPredictor. That is why we are deprecating this mode in this forked repository.

To get a rhythmizer, you need to first choose or design a phoneme dictionary. Then you should label your dataset in the opencpop segments format. Please note that the MIDI duration transcriptions of opencpop is in consonant-vowel format, and you need to label your dataset in vowel-consonant format, which is to say, the beginning of note should be aligned with the beginning of vowels instead of consonants (see issue). Here is an example of the labels that we converted from the original opencpop transcriptions: transcriptions-strict-revised2.txt. The last step is to preprocess your dataset and train a MIDI-A model with this config. After that, you can export the part for duration prediction with this script.

For CVVC languages like English and Polish, the answer is no. Because we currently can only deal with two-phase (CV) phoneme systems like Chinese and Japanese. MIDI-A, MIDI-B, duration predictors, data labels and all other word-phoneme related stuff will be re-designed in the future, and for that time you can expect a full support to all universal languages. No rhythmizers will be needed then - everyone can train their own variance adaptors (containing duration and pitch models and much more) via standard pipelines as easy as that of preparing and training MIDI-less acoustic models for now.

By the way, members of our team are already preparing for a Japanese rhythmizer. When they finish the dictionary, rhythmizer and the MFA model, we will formally support Japanese MIDI-less mode preparation in our pipeline. If you really find difficulties preparing by your own, it is fine to just wait for our progress.

yqzhishen commented 1 year ago

@haru0l Hi there, we started a discussion for the design of Japanese dictionary here: https://github.com/openvpi/DiffSinger/discussions/68

We will be happy if you have interests taking part in that.

haru0l commented 1 year ago

Will do! Also I should close this...