ming024 / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
MIT License
1.69k stars 515 forks source link

model cantnot fit to data, and test voice is too bad when i use the paper configuration #193

Open hhm853610070 opened 1 year ago

hhm853610070 commented 1 year ago

i removed the postnet(remove the code of model and loss about postnet ) and set the pitch_quantization="log",set features of pitch and enery = "frame_level", normalization="False",and other configuration is the same as the project. I use the model(trainning 90w step) to synthesize voice and get poor result.There are many wrong words and noises in many synthesized audios. I don't know is there any code should be modified.

The configuration is followed: (1)preprocessed.yaml: preprocessing: val_size: 512 text: text_cleaners: [] language: "zh" audio: sampling_rate: 22050 max_wav_value: 32768.0 stft: filter_length: 1024 hop_length: 256 win_length: 1024 mel: n_mel_channels: 80 mel_fmin: 0 mel_fmax: 8000 # please set to 8000 for HiFi-GAN vocoder, set to null for MelGAN vocoder pitch: feature: "frame_level" # support 'phoneme_level' or 'frame_level' normalization: False energy: feature: "frame_level" # support 'phoneme_level' or 'frame_level' normalization: False

(2)model.yaml: transformer: encoder_layer: 4 encoder_head: 2 encoder_hidden: 256 decoder_layer: 4 decoder_head: 2 decoder_hidden: 256 conv_filter_size: 1024 conv_kernel_size: [9, 1] encoder_dropout: 0.2 decoder_dropout: 0.2

variance_predictor: filter_size: 256 kernel_size: 3 dropout: 0.5

variance_embedding: pitch_quantization: "log" # support 'linear' or 'log', 'log' is allowed only if the pitch values are not normalized during preprocessing energy_quantization: "linear" # support 'linear' or 'log', 'log' is allowed only if the energy values are not normalized during preprocessing n_bins: 256

multi_speaker: False

max_seq_len: 1000

vocoder: model: "HiFi-GAN" # support 'HiFi-GAN', 'MelGAN' speaker: "universal" # support 'LJSpeech', 'universal'

(3)train.yaml: optimizer: batch_size: 16 betas: [0.9, 0.98] eps: 0.000000001 weight_decay: 0.0 grad_clip_thresh: 1.0 grad_acc_step: 1 warm_up_step: 4000 anneal_steps: [300000, 400000, 500000] anneal_rate: 0.3 step: total_step: 900000 log_step: 100 synth_step: 1000 val_step: 1000 save_step: 50000

The loss curve: image image image image image image

The loss curve has been oscillating

hhm853610070 commented 1 year ago

I hava tried that configuration but the result is terrible.There are many noises and wrong pronounciation within the audio while in the inference,but the audio synthesized of validation in the training step is good.Anyone know why there is such a big difference?

aidosRepoint commented 1 year ago

I hava tried that configuration but the result is terrible.There are many noises and wrong pronounciation within the audio while in the inference,but the audio synthesized of validation in the training step is good.Anyone know why there is such a big difference?

have you found the answer? I have the same thing. I download audios from tensorboard/audio tab. Both, reconstructed and synthesized audios sound very good. However, my inference sounds bad, especially on phrases with more than two words

asarsembayev commented 1 year ago

I hava tried that configuration but the result is terrible.There are many noises and wrong pronounciation within the audio while in the inference,but the audio synthesized of validation in the training step is good.Anyone know why there is such a big difference?

I carefully checked the validation stage during the training process. It seems to me that the synthesized audio in reality is not fully synthesized. It's more reconstructed.