Open yiwei0730 opened 4 months ago
can be replaced with anything you like :) i just implemented the easiest basic plug and play phonemizer. If there is something better that you use, please send a PR and add a flag to use/not use the newer phonemizer. Thanks!
OK, let me think about it. I'm trying the hierspeech in the same time but the preprocessing with yappt to extract_F0 is annoying, it cost so many time.
I would like to ask if there will be any problems if mixed language is used in this project? My primary language is Chinese and my secondary language is English
I don't think so, if model gets signal to use the language, it should adapt
I don't think so, if model gets signal to use the language, it should adapt
I'm thinking about whether I can use Bert-vits2's phoneme, or else use the original-Chinese initial+final+tone_number. However, there are some differences in the data_utils used and need to be improved. bert-vits2 uses the tone_emb、language_embedding in the Text encoder. In addition, I also think that it may be more appropriate to take the data processing part outside and do preprocess_text first before performing training.
@p0p4k I'm a little confused now about the data processing model of multilingual speakers. The way this repo is used special and different from other repo I have deal with in the past. Then it seems that there is no text pre-processing(done in Text with using cleaners) or speaker map settings. Are there any suggestions or scripts for dealing with multilingual speakers and multilanguage?
My idea: Currently, the data folder(my multilingual data) is thrown into pflowtts_pytorch. The plan is to use preprocess_text.py to handle text processing in this program. which Bert-vits2 used. When entering the textmodule, you only need to read (you need to create a spk_map yourself to read to convert spk into numbers) Finally, improve the python files train, speech-prompt, and pflowtts.
I don’t know if there’s anything I’ve overlooked or something I could ask for advice on.
@yiwei0730 you can use the bert_vits2 dataloader and modify some part in pflow to use it directly, throw away audio since we do not do e2e, just need spectrograms.
I would like to know why chose phonemizer as the method of use in the first place. I would like to ask if the mixed language (chinese+english) processing method can be replaced by bopomofo_to_ipa ?