jjery2243542 / adaptive_voice_conversion

Apache License 2.0
469 stars 89 forks source link

About the number of mel-scale spectrogram bin #23

Open sbkim052 opened 4 years ago

sbkim052 commented 4 years ago

I found out that several(most) vocoders or other tts models use mel-pectrogram channel "80".

In this work, the model is using 512 channels.

why is this model using 512 channels which is way more than other tts and vocoder models?

xuexidi commented 3 years ago

I also have this question.......I want to change the vocoder of One-Shot,but found chennel mismatch, which means I have to train my NN vocoder in a 512....., or change the channel of One-Shot to 80 to fit the vocoder.....

sbkim052 commented 3 years ago

I also have this question.......I want to change the vocoder of One-Shot,but found chennel mismatch, which means I have to train my NN vocoder in a 512....., or change the channel of One-Shot to 80 to fit the vocoder.....

Hi, @xuexidi Actually, i changed this model's channel to 80 which fits to my NN vocoder and trained with VCTK. It works fine with 80 channels.

xuexidi commented 3 years ago

I also have this question.......I want to change the vocoder of One-Shot,but found chennel mismatch, which means I have to train my NN vocoder in a 512....., or change the channel of One-Shot to 80 to fit the vocoder.....

Hi, @xuexidi Actually, i changed this model's channel to 80 which fits to my NN vocoder and trained with VCTK. It works fine with 80 channels.

@sbkim052 Thank you very much for your reply! I still have some questions and need your help,please! I want to use WaveRNN as the NN Vocoder of One-Shot. But I am not sure how much sound quality can be improved. And I have the following questions: If I want to use WaveRNN as a Vocoder, do I need to adjust the format of the mel spectrum output by One-Shot? Or can the output Mel spectrum of One-Shot be directly used as the input feature of WaveRNN to train the WaveRNN model?

PS:I am a newbie in the field of speech, hope you can help me,thank you very much!

sbkim052 commented 3 years ago

Hi @xuexidi

I tried on two methods.

  1. training the nn vocoder with the output of the one-shot VC
  2. training the one-shot VC with the format of the mel spectrogram which was used for training the nn vocoder.

And I also suggest you to use Universial Vocoder which is based on WaveRNN and also, Melgan. They both work on multi- speakers.

xuexidi commented 3 years ago

Hi @xuexidi

I tried on two methods.

  1. training the nn vocoder with the output of the one-shot VC
  2. training the one-shot VC with the format of the mel spectrogram which was used for training the nn vocoder.

And I also suggest you to use Universial Vocoder which is based on WaveRNN and also, Melgan. They both work on multi- speakers.

Hi @sbkim052 Thank you so much for replying so quickly! Based on the above content, I have the following two questions:

  1. You mentioned "training the nn vocoder with the output of the one-shot VC" and "training the one-shot VC with the format of the mel spectrogram which was used for training the nn vocoder." ----Which method can achieve better results?
  2. Does WaveRNN not work well as a Vocoder of One-Shot? If so, I will consider replacing the vocoder with Melgan and so on.

Thank you very much for your suggestion! Please also take your precious time to answer my questions, thank you very much!

sbkim052 commented 3 years ago

@xuexidi No problem:)

There were some tradeoffs for both methods, but for me, training the one-shot with the format of the mel which was used for training the nn vocoder worked better.

WaveRNN worked fine, but Melgan was better performing for me (my subjective opinion)

xuexidi commented 3 years ago

@xuexidi No problem:)

There were some tradeoffs for both methods, but for me, training the one-shot with the format of the mel which was used for training the nn vocoder worked better.

WaveRNN worked fine, but Melgan was better performing for me (my subjective opinion)

@xuexidi No problem:)

There were some tradeoffs for both methods, but for me, training the one-shot with the format of the mel which was used for training the nn vocoder worked better.

WaveRNN worked fine, but Melgan was better performing for me (my subjective opinion)

Hi @sbkim052 Gratitude is beyond words! This is the last question: You mentioned "training the one-shot with the format of the mel which was used for training the nn vocoder":

Does this mean that we should keep the steps of converting wav into Mel spectra in One-Shot consistent with Vocoder?

E.g: One-Shot data preprocess includes: step1:Preemphasis step2:stft step3:trasfer magnitude spectrogram into mel spectrogram step4: normalize the mel spectrogram.

WaveRNN data preprocess includes: step1: reshape wav data (wav = wav / max(wav)) step2:stft step3: normalize the mel spectrogram.

If I want to training the one-shot with the format of the mel which was used for training the nn vocoder worked: should I change the One-Shot data preprocess step to be the same as the WaveRNN data preprocess step? Not sure if this will affect the performance of One-Shot...

Similarly, the parameters such as n_fft, hop_length, win_length in One-Shot are set to be consistent with Vocode?

Thank you very much, you are so kind!

sbkim052 commented 3 years ago

Hi @xuexidi Exactly what you said. converting wav into mel spectrum should be consistent. Training One-shot with the same format of mel used in vocoder is quite important. The whole preprocessing and hyperparameters such as n_fft, hop_length, win_length, mel-bank, etc should be consistent with Vocoder.

I think there might be two cases. If you want to use pre-trained vocoder, you should train one-shot with the format of the mel which was used for vocoder. If you want to use pre-trained one-shot, you should train vocoder with the format of the mel which was used for one-shot.

Not sure whether the performance will improve dramatically, but I had many trial and error, and it kind of improved on my cases:)

xuexidi commented 3 years ago

Hi @xuexidi Exactly what you said. converting wav into mel spectrum should be consistent. Training One-shot with the same format of mel used in vocoder is quite important. The whole preprocessing and hyperparameters such as n_fft, hop_length, win_length, mel-bank, etc should be consistent with Vocoder.

I think there might be two cases. If you want to use pre-trained vocoder, you should train one-shot with the format of the mel which was used for vocoder. If you want to use pre-trained one-shot, you should train vocoder with the format of the mel which was used for one-shot.

Not sure whether the performance will improve dramatically, but I had many trial and error, and it kind of improved on my cases:)

Hi @sbkim052 That very kind of you! Today I tried to train One-Shot on the Mel spectrogram in WaveRNN vocoder Mel spectrogram format. Then I used the pre-trained WaveRNN model as a vocoder to generate audio. The resulting audio quality is indeed better than Griffin Lim. But the generated audio still has a little noise So, I will try to train a vocoder by myself, using the converted Mel spectrogram from One-shot as input and source audio as label. Hope to have a good result~

In addition, I haven't tried Melgan vocoder yet. I wonder whether the performance of the pre-trained Melgan model will be more powerful than WaveRNN? I want to use pre-trained models as much as possible, because vocoder model training seems very time-consuming...

sbkim052 commented 3 years ago

Yes it is very time-consuming. I tried training only WaveRNN, not Melgan. But I prefer using pretrained Melgan. It worked best for me. (still some space to improve) It will be a pleasure if you share the result:)

xuexidi commented 3 years ago

Yes it is very time-consuming. I tried training only WaveRNN, not Melgan. But I prefer using pretrained Melgan. It worked best for me. (still some space to improve) It will be a pleasure if you share the result:)

@sbkim052 Okay, as soon as the experiment yields good results, I will share my results and technical details with you immediately!

By the way, did you train on the Korean data set or the English data set?

I guess that the pre-trained WavRNN model is trained on the English data set, so there will be noise in the speech I generate.

By the way, I would like to ask you, which open source Melgan github project are you using? (Projects containing pre-trained models), thank you very much, and God bless you!

sbkim052 commented 3 years ago

@xuexidi I trained only on the VCTK(English) dataset since there's not enough Korean dataset :(

Check out the link below, this is the official implementation i used. https://github.com/descriptinc/melgan-neurips

By Pytorch hub, in 4-5 lines, you can easily use the pre-trained melgan.

xuexidi commented 3 years ago

@xuexidi I trained only on the VCTK(English) dataset since there's not enough Korean dataset :(

Check out the link below, this is the official implementation i used. https://github.com/descriptinc/melgan-neurips

By Pytorch hub, in 4-5 lines, you can easily use the pre-trained melgan.

@sbkim052 Thank you very much, you are simply my savior! I will share the technical details of my results with you for the first time, once I get a better experimental result!

Merlin-721 commented 3 years ago

Is it possible to train with 512 bins at the model input and 80 bins at the output?