p0p4k / pflowtts_pytorch

Unofficial implementation of NVIDIA P-Flow TTS paper
https://neurips.cc/virtual/2023/poster/69899
MIT License
198 stars 28 forks source link

phoneme question #38

Open yiwei0730 opened 4 months ago

yiwei0730 commented 4 months ago

I would like to know why chose phonemizer as the method of use in the first place. I would like to ask if the mixed language (chinese+english) processing method can be replaced by bopomofo_to_ipa ?

p0p4k commented 4 months ago

can be replaced with anything you like :) i just implemented the easiest basic plug and play phonemizer. If there is something better that you use, please send a PR and add a flag to use/not use the newer phonemizer. Thanks!

yiwei0730 commented 3 months ago

OK, let me think about it. I'm trying the hierspeech in the same time but the preprocessing with yappt to extract_F0 is annoying, it cost so many time.

yiwei0730 commented 3 months ago

I would like to ask if there will be any problems if mixed language is used in this project? My primary language is Chinese and my secondary language is English

p0p4k commented 3 months ago

I don't think so, if model gets signal to use the language, it should adapt

yiwei0730 commented 3 months ago

I don't think so, if model gets signal to use the language, it should adapt

I'm thinking about whether I can use Bert-vits2's phoneme, or else use the original-Chinese initial+final+tone_number. However, there are some differences in the data_utils used and need to be improved. bert-vits2 uses the tone_emb、language_embedding in the Text encoder. In addition, I also think that it may be more appropriate to take the data processing part outside and do preprocess_text first before performing training.

yiwei0730 commented 3 months ago

@p0p4k I'm a little confused now about the data processing model of multilingual speakers. The way this repo is used special and different from other repo I have deal with in the past. Then it seems that there is no text pre-processing(done in Text with using cleaners) or speaker map settings. Are there any suggestions or scripts for dealing with multilingual speakers and multilanguage?

My idea: Currently, the data folder(my multilingual data) is thrown into pflowtts_pytorch. The plan is to use preprocess_text.py to handle text processing in this program. which Bert-vits2 used. When entering the textmodule, you only need to read (you need to create a spk_map yourself to read to convert spk into numbers) Finally, improve the python files train, speech-prompt, and pflowtts.

I don’t know if there’s anything I’ve overlooked or something I could ask for advice on.

p0p4k commented 3 months ago

@yiwei0730 you can use the bert_vits2 dataloader and modify some part in pflow to use it directly, throw away audio since we do not do e2e, just need spectrograms.