Question about training in other languages like English

francqz31 commented 1 year ago

Hello, Really nice work you are doing here. I was just wondering if you can inform and guide me about how can I create the music score for my custom English dataset that has about 5-6 hours of English singing and yes it is already transcribed in just text , and also wondering if it is manual work or is there a tool or python code to automate it (the music score thing), I academically know that the music score consists of ,input text, input note and input duration , I'm asking about how can I create the input note and duration? because i want to train different SVS algorithms and they all require music score starting from hifi-singer until Diffsinger,, So i think i need to make my English dataset formatted like opencpop. here is an example of a single line of train.txt of opencpop:

"2002000043|疯狂|f eng k uang SP|D#4/Eb4 D#4/Eb4 D4 D4 rest|0.609200 0.609200 2.137240 2.137240 3.058880|0.16178 0.44742 0.11219 2.02505 3.05888|0 0 0 0 0"

To be honest I don't understand all the EBD# stuff and also the different decimal numbers and the bunch of 0000s

and also whether or not I'm going to need to do the "|f eng k uang SP" English phoneme translation between the input text |疯狂| and input note "D#4/Eb4 D#4/Eb4 D4 D4 rest" since my dataset is already in English because i think that "|f eng k uang SP" is the English character translation of this Chinese sung word ||疯狂|

If you can guide me in details , I would be happy to give back and donate the English singing dataset to you since you know there isn't a single one available.

btw i saw the "SingingVoice-MFA-Training" repository but i don't understand it and I'm not sure if it can help in my situation , And I don't know if it will work with English or just Chinese.

If you can answer this that would be so amazing because I'm lost , Thanks in advance. :)

yqzhishen commented 1 year ago

The label of opencpop means "filename | lyrics | phoneme sequence | MIDI sequence | MIDI duration sequence | phoneme duration sequence | is slur sequence" (you can refer to their paper for more details). "EDB#"s are actually musical note names and the following number represents different octaves. These are fundamental musical conceptions.

Due to difficulties to label MIDI sequence and slurs, we developed a MIDI-less mode of DiffSinger, where MIDI sequence, durations and slurs are all removed from the labels. In this mode the transcription is like

"2002000043|啊|f eng k uang SP|rest|0|0.16178 0.44742 0.11219 2.02505 3.05888|0"

where "啊", "rest" and "0" are only placeholders. In this mode only the phoneme sequence and the duration of each phone is labeled and used. At inference time, you input the phonemes, durations and the whole f0 curve, and the model will synthesize the singing voice (however, no duration predictor and f0 predictor available). We do have a complete pipeline to build your own dataset at pipelines/no_midi_preparation.ipynb where you only need to label lyrics of your voice, but however I'm sorry that this pipeline can only deal with mandarin Chinese data.

The reasons why we only support easy preparation pipelines for Chinese are:

We only have a corpus of ~50h singing voice in Chinese with transcriptions to train the Montreal Forced Aligner model. Other languages and dictionaries also need this amount of data to train. MFA provides official pretrained models, but those are only suitable for speaking voice and may have poor performance on singing.
Other languages, for example, English, have differences in the phoneme system from Chinese. In the Chinese phoneme system, each word has only one syllable and each syllable has no more than 1 leading consonant, and there are no trailing consonants, which we call a "two-phase" phoneme system. Japanese is also of this type, but English is very different since there are usually more than 2 phonemes in one single syllable and more than 1 syllable in one word. This may cause the model to produce prediction errors on word boundaries. That is to say, we are not able convert phoneme sequence back to word sequence, and so is the neural network. This may cause pronunciation errors on liaisons (but we have not tested them yet).
This project is still in active development. We currently focus on Chinese singing voice synthesis because we have a very active domestic community where data and information are easier to obtain and share. The former architecture, including the label, preprocessing, training and phoneme duration predictor, etc. in the network, have very heavy dependency on two-phase phoneme systems. We removed those stuff and planned to re-design these word-phoneme related things, but this will take time.

If you would like to train an English model now, there are too many things you need to do that we cannot provide help on everything. You may need knowledge about words, syllables, consonants and vowels, musical scores, slurs and rhythms, you need an MFA model to align your transcriptions (you can directly refer to the official documentation). You may also need coding and debugging skills to adapt the current pipelines for MIDI-less dataset preparation to English, and even if you finished this we still cannot ensure the network will be suitable for other languages. This issue and this doc may help you understand phoneme systems in singing rhythms and the difference between English and Chinese. We currently use this dictionary for Chinese, and you also need something like that to train MFA models and label your data.

Anyway, I will be very glad if you successfully adapt the data preparation pipelines to English. Otherwise please stay tuned on our development and support for any universal phoneme systems will come in the future.

DuQingChen commented 1 year ago

pipelines/no_midi_preparation.ipynb where you only need to label lyrics of your voice how about symbols in lyrics, such as "," example : "左手握大地右手握着天,掌纹裂出了十方的闪电."

yqzhishen commented 1 year ago

Support for universal dictionary is implemented in #90 so this issue is to be closed. Please wait for the v2.0.0 release.

openvpi / DiffSinger

Question about training in other languages like English #29