jxzhanggg / nonparaSeq2seqVC_code

Implementation code of non-parallel sequence-to-sequence VC
MIT License
248 stars 56 forks source link

Training the model for a different language #36

Open ivancarapinha opened 4 years ago

ivancarapinha commented 4 years ago

Hello @jxzhanggg, First of all, thank you for your helpful replies to the previous issues I posted. I would like to adapt this voice conversion model to European Portuguese. The thing is, I do not have a data set as large as VCTK in terms of nr. of utterances per speaker. I do have enough training data for at least 5-6 speakers (more than 500 utterances per speaker), sampled at 16 kHz. I tried several configurations, with batch sizes 8, 16 and 32 for pre-training but never managed to generate intelligible speech (decoder alignments did not converge). I changed the phonemizer backend in extract_features.py from Festival to Espeak, so that I could obtain phoneme transcriptions in Portuguese. I noticed that the total number of different phonemes increased substantially, from 41 (in English) to 66 (in Portuguese). I assume this makes the decoding task more difficult. Also, I experimented with the fine-tune model and the results improved a little bit (sometimes one or two words are intelligible, but still unintelligible utterances overall).

My questions are the following:

Thank you very much

jxzhanggg commented 4 years ago

Should I try to use the pre-train model, even with only 5-6 speakers, or should I use only the fine-tune model instead?

I think more data is always favorable for training the model. So it should be useful to pretrain on more data.

What would you suggest in order to solve the decoder alignment problem?

I found the alignment converagence can be tricky, here're some of my experience:

  1. Try to use shorter utterances, you can cut long utterances into smaller pieces if possible.
  2. Also you can gradually increase the maximum length of utterances, it's kind like curriculum learning. At the begining, only use short utterances to train the model to make alignment easy to learn.
  3. If alignment collapse during training, try to decease the learning rate or enlarge batch size.