p0p4k / pflowtts_pytorch

Unofficial implementation of NVIDIA P-Flow TTS paper
https://neurips.cc/virtual/2023/poster/69899
MIT License
198 stars 28 forks source link

about zero-shot inference #37

Open 0913ktg opened 4 months ago

0913ktg commented 4 months ago

Hello p0p4k,

I'm reaching out to you again with a question.

Thanks to your great help, I've successfully trained and inferred the Korean pflow model. During the inference process, I observed a few limitations, but I confirmed that satisfactory voices are synthesized for seen speakers.

I used data from about 3,000 male and female speakers, only utilizing voice files with durations longer than 4.1 seconds. I conducted distributed learning with a batch size of 64 on NVIDIA A100 40G (4 units), completing 160 epochs (500k steps).

However, when synthesizing voices using unseen speakers' voices as prompts, I found that while the voice content is well synthesized, the speakers' voices are not applied to the synthesized sound.

This phenomenon was observed for both male and female speakers, and the inference code was written referring to the synthesis.ipynb (almost identical).

I'm looking into why the speaker's voice characteristics are not applied in zero-shot inference.

If you have experienced the same issue or know anything about it, I would appreciate your help. If there's any additional information I should provide, please comment below.

Thank you.

0913ktg commented 4 months ago
p0p4k commented 4 months ago

So that means the speech prompt encoder is not extracting exact speaker style info, but rather memorizing the seen speaker information. We might have to change the architecture a little bit in that case.

0913ktg commented 4 months ago

Thank you for your response. I will try to modify it to extract speaker characteristics in comparison with the content of the paper.

If I achieve good results, I will make sure to share them with you. Thank you once again for your invaluable help. Have a great day!

p0p4k commented 4 months ago

Maybe take a look at audiobox/voicebox architecture as well.

yiwei0730 commented 4 months ago

Maybe take a look at audiobox/voicebox architecture as well.

maybe the naturalspeech 2 speech prompt encoder can help? but I'm not sure its really be useful, Do you have any suggestions? 1

NS2 Speech Prompt Encoder Transformer Layer 6 Attention Heads 8 Hidden Size 512 Conv1D Filter Size 2048 Conv1D Kernel Size 9 Dropout 0.2 Parameters 69M

p-flow Speech-prompted Text Encoder Phoneme Embedding Dim 192 PreNet Conv Layers 3 PreNet Hidden Dim 192 PreNet Kernel Size 5 PreNet Dropout 0.5 Transformer Layers 6 Transformer Hidden Dim 192 Transformer Feed-forward Hidden Dim 768 Transformer Attention Heads 2 Transformer Dropout 0.1 Prompt Embedding Dim 192 Number of Parameters 3.37M

p0p4k commented 4 months ago

Yes, it can help. We can yank lucidrains's code. Can you do a PR?

yiwei0730 commented 4 months ago

I'm glad you think this solution useful, but I work for a company and can't upload the code to GITHUB. I will report back in time if I have any new progress.

0913ktg commented 4 months ago

Hi yiwei0730. Thank you for your advice. I'll do some testing and share with you. Thank you.

0913ktg commented 4 months ago

Hello p0p4k, yiwei0730,

I have incorporated the prompt encoder part from the 'https://github.com/adelacvg/NS2VC' repository to extract prompt features for the text encoder.

The reason I chose this model is that it uses mel spectrograms for training, as opposed to ns2, which utilizes codec representations.

I plan to conduct two experiments: one adding the prompt encoder to the model structure mentioned in the paper, and another incorporating it into p0p4k's structure.

I will share the results once they are available.

yiwei0730 commented 4 months ago

If you need training assistance, I may provide some usage support. I think the ZS model with less data is also a good development strategy. In addition, I have trained 1500hr data (zh+en) on the NS2 github you are using. It seems that the similarity is not good enough when extracting the non-training speaker set.( mos=4, but smos=3.) Not sure if it is because the codec was not used as the first step of training during the original training.

0913ktg commented 4 months ago

Hello,

I have conducted an experiment by adding the ns2 prompt encoder to the P-Flow text encoder. This was applied to both the structure provided by p0p4k and the one presented in the paper, with some noticeable differences observed.

When adding the ns2 prompt encoder to the paper's structure, there was a significant improvement in the clarity of the mel-spectrogram. There was a notable reduction in noise, and the frequency values were more distinct. However, there is an issue where the output is always in a female voice, regardless of the gender of the prompt voice (even when male voice prompts are used).

On the other hand, adding the ns2 prompt encoder to p0p4k's structure resulted in relatively more noise compared to when it was added to the structure from the paper. Additionally, the phenomenon of the output being in a male voice, as before, was observed.

In conclusion, it seems necessary to continue experimenting with the addition of the prompt encoder to the text encoder structure described in the paper and adjusting the parameters accordingly.

If there are any other models you think would be worth trying, please feel free to share.

Thank you.

0913ktg commented 4 months ago
p0p4k commented 4 months ago

Interesting 🤔

0913ktg commented 4 months ago

Is this model zero-shot TTS possible?

p0p4k commented 4 months ago

In pflow blog, the authors say it is possible if we use more data and bigger model size.

yiwei0730 commented 4 months ago

How many data used for the model training enough for the ZS-TTS, example for 2K hours chinese+english data?

p0p4k commented 4 months ago

Not sure, cause they didn't give the exact data they used. Audiobox paper uses around 60k hours?

0913ktg commented 4 months ago

The Korean data I used for training is 1186 hours.

yiwei0730 commented 4 months ago

Data: We train P-Flow on LibriTTS [41]. LibriTTS training set consists of 580 hours of data from 2,456 speakers. We specifically use data that is longer than 3 seconds for speech prompting, yielding a 256 hours subset. For evaluation, we follow the experiments in [37, 19] and use LibriSpeech test-clean, assuring no overlap exists with our training data. We resample all datasets to 22kHz.

I saw this in the paper ? It is just 580hr.

yiwei0730 commented 4 months ago

In pflow blog, the authors say it is possible if we use more data and bigger model size.

where is the reply blog?

0913ktg commented 4 months ago

The authors even wrote that zero-shot TTS of comparable quality to VALL-E is possible with less data. image

yiwei0730 commented 4 months ago

The authors even wrote that zero-shot TTS of comparable quality to VALL-E is possible with less data. image

Right! That’s why I’m following this paper. I found that it is the only model that states that it does not require a large data set to optimize ZS, but I don’t know if it is because it uses an English data set, so it may overestimate its function. I find that data sets similar to English can often achieve good results, but the use of East Asian languages, such as Chinese, Japanese, and Korean, does not seem to be better (example for like seamlessm4T)

p0p4k commented 4 months ago

https://pflow-demo.github.io/projects/pflow/ image https://openreview.net/forum?id=zNA7u7wtIN image

0913ktg commented 4 months ago

I can't play the demo audio, but p0p4k can?

yiwei0730 commented 4 months ago

I have seen this website, but the audio files cannot be played. And his reply in Openreview should mean that more data can have better results, but a basic amount of data should also have base level performance.

p0p4k commented 4 months ago

Audio files could be played when they released the paper. If this repo doesn't give great results right now, all we can do is change the speech prompt encoder and train for longer.

p0p4k commented 4 months ago

@0913ktg can you add pos embeddings to the speech_prompt_text_encoder before transformers? I think i missed that part. Please send a PR and i will approve it. Thanks!

yiwei0730 commented 4 months ago

Hello,

I have conducted an experiment by adding the ns2 prompt encoder to the P-Flow text encoder. This was applied to both the structure provided by p0p4k and the one presented in the paper, with some noticeable differences observed.

When adding the ns2 prompt encoder to the paper's structure, there was a significant improvement in the clarity of the mel-spectrogram. There was a notable reduction in noise, and the frequency values were more distinct. However, there is an issue where the output is always in a female voice, regardless of the gender of the prompt voice (even when male voice prompts are used).

On the other hand, adding the ns2 prompt encoder to p0p4k's structure resulted in relatively more noise compared to when it was added to the structure from the paper. Additionally, the phenomenon of the output being in a male voice, as before, was observed.

In conclusion, it seems necessary to continue experimenting with the addition of the prompt encoder to the text encoder structure described in the paper and adjusting the parameters accordingly.

If there are any other models you think would be worth trying, please feel free to share.

Thank you.

I would like to ask what do you think the SMOS and MOS are after adding PE to the training? Can you implement the ZS method? What if you add the finetune method with less data?

p0p4k commented 4 months ago

I added PE in recent push, and it seems to give better ZS results. Would like to see you guys train and report as well.

yiwei0730 commented 3 months ago

Hello p0p4k,

I'm reaching out to you again with a question.

Thanks to your great help, I've successfully trained and inferred the Korean pflow model. During the inference process, I observed a few limitations, but I confirmed that satisfactory voices are synthesized for seen speakers.

I used data from about 3,000 male and female speakers, only utilizing voice files with durations longer than 4.1 seconds. I conducted distributed learning with a batch size of 64 on NVIDIA A100 40G (4 units), completing 160 epochs (500k steps).

However, when synthesizing voices using unseen speakers' voices as prompts, I found that while the voice content is well synthesized, the speakers' voices are not applied to the synthesized sound.

This phenomenon was observed for both male and female speakers, and the inference code was written referring to the synthesis.ipynb (almost identical).

I'm looking into why the speaker's voice characteristics are not applied in zero-shot inference.

If you have experienced the same issue or know anything about it, I would appreciate your help. If there's any additional information I should provide, please comment below.

Thank you.

@0913ktg sorry to bother you, I would like to ask if you can upload some synthesized audio files so that I can listen to the quality.

rishikksh20 commented 3 months ago

@0913ktg Have you tried NS2VC prompt encoder with Pflow ?