k2kobayashi / crank

A toolkit for non-parallel voice conversion based on vector-quantized variational autoencoder
MIT License
168 stars 31 forks source link

Dataset question #25

Closed pavelxx1 closed 3 years ago

pavelxx1 commented 3 years ago

Hi, first thx for this great repo!) I have question

k2kobayashi commented 3 years ago

If the feature vectors are extracted correctly, that's no problem. But I assumed every audio length is around 5s.

pavelxx1 commented 3 years ago

Thx, 1) How many wav- files(dataset) need for good training result? 2) If I want convert speakerA to speakerB can I use small dataset of speakerA, ie 5-10 wav-files? Thx

k2kobayashi commented 3 years ago
  1. vcc2020v1 and vcc2018v1 use about 1k utterances for training and it produces acceptable quality.
  2. I didn't try such condition. I predict 10~30 utterances / speaker is minimum # of training utterances for 2 speaker modeling.