openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.69k stars 283 forks source link

Question about training in other languages like English #29

Closed francqz31 closed 1 year ago

francqz31 commented 1 year ago

Hello, Really nice work you are doing here. I was just wondering if you can inform and guide me about how can I create the music score for my custom English dataset that has about 5-6 hours of English singing and yes it is already transcribed in just text , and also wondering if it is manual work or is there a tool or python code to automate it (the music score thing), I academically know that the music score consists of ,input text, input note and input duration , I'm asking about how can I create the input note and duration? because i want to train different SVS algorithms and they all require music score starting from hifi-singer until Diffsinger,, So i think i need to make my English dataset formatted like opencpop. here is an example of a single line of train.txt of opencpop:

"2002000043|疯狂|f eng k uang SP|D#4/Eb4 D#4/Eb4 D4 D4 rest|0.609200 0.609200 2.137240 2.137240 3.058880|0.16178 0.44742 0.11219 2.02505 3.05888|0 0 0 0 0"

To be honest I don't understand all the EBD# stuff and also the different decimal numbers and the bunch of 0000s

and also whether or not I'm going to need to do the "|f eng k uang SP" English phoneme translation between the input text |疯狂| and input note "D#4/Eb4 D#4/Eb4 D4 D4 rest" since my dataset is already in English because i think that "|f eng k uang SP" is the English character translation of this Chinese sung word ||疯狂|

If you can guide me in details , I would be happy to give back and donate the English singing dataset to you since you know there isn't a single one available.

btw i saw the "SingingVoice-MFA-Training" repository but i don't understand it and I'm not sure if it can help in my situation , And I don't know if it will work with English or just Chinese.

If you can answer this that would be so amazing because I'm lost , Thanks in advance. :)

yqzhishen commented 1 year ago

The label of opencpop means "filename | lyrics | phoneme sequence | MIDI sequence | MIDI duration sequence | phoneme duration sequence | is slur sequence" (you can refer to their paper for more details). "EDB#"s are actually musical note names and the following number represents different octaves. These are fundamental musical conceptions.

Due to difficulties to label MIDI sequence and slurs, we developed a MIDI-less mode of DiffSinger, where MIDI sequence, durations and slurs are all removed from the labels. In this mode the transcription is like

"2002000043|啊|f eng k uang SP|rest|0|0.16178 0.44742 0.11219 2.02505 3.05888|0"

where "啊", "rest" and "0" are only placeholders. In this mode only the phoneme sequence and the duration of each phone is labeled and used. At inference time, you input the phonemes, durations and the whole f0 curve, and the model will synthesize the singing voice (however, no duration predictor and f0 predictor available). We do have a complete pipeline to build your own dataset at pipelines/no_midi_preparation.ipynb where you only need to label lyrics of your voice, but however I'm sorry that this pipeline can only deal with mandarin Chinese data.

The reasons why we only support easy preparation pipelines for Chinese are:

If you would like to train an English model now, there are too many things you need to do that we cannot provide help on everything. You may need knowledge about words, syllables, consonants and vowels, musical scores, slurs and rhythms, you need an MFA model to align your transcriptions (you can directly refer to the official documentation). You may also need coding and debugging skills to adapt the current pipelines for MIDI-less dataset preparation to English, and even if you finished this we still cannot ensure the network will be suitable for other languages. This issue and this doc may help you understand phoneme systems in singing rhythms and the difference between English and Chinese. We currently use this dictionary for Chinese, and you also need something like that to train MFA models and label your data.

Anyway, I will be very glad if you successfully adapt the data preparation pipelines to English. Otherwise please stay tuned on our development and support for any universal phoneme systems will come in the future.

DuQingChen commented 1 year ago

pipelines/no_midi_preparation.ipynb where you only need to label lyrics of your voice how about symbols in lyrics, such as "," example : "左手握大地右手握着天,掌纹裂出了十方的闪电."

yqzhishen commented 1 year ago

Support for universal dictionary is implemented in #90 so this issue is to be closed. Please wait for the v2.0.0 release.