Closed george-roussos closed 4 years ago
My understanding was multispeaker would yield much better results, since the range of waveforms and sounds is much, much broader and the hours of speech is more than 200.
In my experiments, the model with long single speaker data (say, around 24 hours) is better than the case with multi speaker data (e.g., serveral minutes x various speakers). So I think the important point is that the length of each speaker's speech and its quality. If you add bad quality speaker, it may cause quality degradation.
This is exactly what I thought at first, and I would care to wager that some LibriTTS recordings have a lot of noise. My single speaker, admittedly, had no background noise. However, then I thought that during the GAN training noise is introduced, no? So I thought introducing noise within the set may make the model more robust. I got the exact same noise you got in your pretrained LibriTTS model, like a "zzzzzz" kind of thing. But I checked your VCTK pretrained model on multiband melgan and it sounds much, much better.
I also think you have a point in duration of speech for every speaker; although, I think it then becomes harder to tackle as a problem. Have you done any more experiments with multispeaker? I guess I would be willing to train one more model and carefully do data curation, but I am not really sure how many speakers are needed for the model to be able to generalize to other speakers.
However, then I thought that during the GAN training noise is introduced, no?
What is the noise?
But I checked your VCTK pretrained model on multiband melgan and it sounds much, much better.
I think VCTK is better recording than libritts. Did you compare my samples of libritts and vctk?
but I am not really sure how many speakers are needed for the model to be able to generalize to other speakers.
This is very difficult question. I have no clear answer.
I do not have samples at hand right not unfortunately, but it was something like a static noise that sounded like "zzzz". Metallic-like, kind of. If you take a listen at your LibriTTS and VCTK gen samples, the VCTK gen sounds so much clearer and very, very close to ground truth. So I think you are right in saying it depends on the dataset sound quality.
Now I read this paper here and they report generalizing to unseen speakers on 60 hours (10 hours for 6 speakers). I will try it and report here. If it does not work, I think I will also try training a PWGAN on VCTK.
Hi,
I have been trying to train different models on LibriTTS and some internal speakers, but I think the results leave much to be desired. While in single speaker I get good results, in multispeaker it is not the same case. My understanding was multispeaker would yield much better results, since the range of waveforms and sounds is much, much broader and the hours of speech is more than 200. What happens is that it either synthesizes with static, or it sounds muffled. I am talking about TTS synthesis by the way. I of course am not after WaveNet quality, however single speaker tests sound much better.
Taco2 has been trained for a long time, so the spectrograms should be clean, from that side.
My config is this (which I am guessing may have something wrong), which is also what I used for my single speaker training, which yielded good results:
I have downsampled the dataset and changed the hop and win sizes to accomodate with my TTS attributes.