r9y9 / deepvoice3_pytorch

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models
https://r9y9.github.io/deepvoice3_pytorch/
Other
1.96k stars 484 forks source link

Improving speaker adaptation with few voice samples #89

Closed yrahul3910 closed 5 years ago

yrahul3910 commented 6 years ago

Hi, I tried adapting the pre-trained DeepVoice3 model to a dataset with only 23 voice samples (about 2 minutes) of only one speaker, using the LJSpeech preset. Does DeepVoice3 require more audio samples? After training for 1100 steps (about 4 hours on my system), it produced practically empty audio: 0_checkpoint_step000001100_alignment 1_checkpoint_step000001100_alignment 2_checkpoint_step000001100_alignment Do I need more voice samples? Is there a rough figure for the same?

G-Wang commented 6 years ago

I think that's too little data for speaker adaptation, the smallest data I've had is around 150 voice samples. Also you shouldn't be training on the small dataset for that that long, that's why your model is completely degenerate.

For my example it only took about 10-20 epochs through the 150 voice samples to get good adaptation, training longer will only degrade the model to overfit to the small data samples.

Also if you can, try to get samples with variation in words, for example from the Harvard Sentence list http://www.cs.cmu.edu/afs/cs.cmu.edu/project/fgdata/Recorder.app/utterances/Type1/harvsents.txt

yrahul3910 commented 6 years ago

I see. I've doubled my data to about 60 samples, but I can definitely get more. It's a much better graph now, though it's not quite there yet. Are you sure you only ran 10-20 epochs? That would take just a few seconds, if I'm not mistaken, right? Seems awfully small. I'm basically just training till the loss function seems like a small value.

r9y9 commented 6 years ago

I'm just curious why you expect it works with 23 samples... End-to-end models are hard to optimize and data hungry! Anyway, I would try freezing seq2seq model parameters and train only the postnet part. That could stabilize training.

navinthenapster commented 6 years ago

@G-Wang what is the total duration of all samples(150 samples) you trained for ?

yrahul3910 commented 6 years ago

I'm curious too...I re-trained with about 4m of data and I'm getting noticeably better results, but I want to see what the minimum threshold is.

navinthenapster commented 6 years ago

@yrahul3910 i have problem in the synthesis voice for speaker adaption . It is synthesis empty .wav file and checkpoint is not saving properly. Do you preprocess your data with gentle or json meta.

can you give me rough idea on speaker adaption. from preparing dataset

yrahul3910 commented 6 years ago

@navinthenapster Uh damn, it's been a while, not sure I remember. Okay, so I definitely preprocessed using json meta, and I remember initially getting empty wav files as output, but for me the issue was that my voice samples (for adaptation) were not loud enough, so I threw them out and made another 4m of audio in about 60 samples, but making sure they were much louder (if you're not sure, I used lyrebird.ai for a bunch of sentences, plus it tells you if you're not loud enough).

As for the dataset, I simply took the pretrained model, and ran train.py on it...I don't remember what options I gave it, but they're definitely in the docs somewhere....then like I said, I used lyrebird.ai to get some sentences, downloaded my wav files, and just ran the model for a little while....I think I went till 1400 steps...pretty decent results, but I kinda just lost interest I think, maybe I'll get back to it sometime.

navinthenapster commented 6 years ago

Thanks for quick reply

G-Wang commented 5 years ago

@navinthenapster mine total duration was about 20-ish minutes.

aishweta commented 5 years ago

@yrahul3910 @navinthenapster Have you both got good results?

I think my checkpoints not gets saved, as I ran :+1:

  1. training python train.py --data-root=/home/ubuntu/shweta/voice-cloning/deepvoice3_pytorch/data/ljspeech --checkpoint-dir=pre --preset=presets/deepvoice3_ljspeech.json --log-event-path=log --restore-parts="/home/ubuntu/shweta/voice-cloning/deepvoice3_pytorch/pre/20180505_deepvoice3_checkpoint_step000640000.pth" --speaker-id=0

  2. synthesizing python synthesis.py /pre/20180505_deepvoice3_checkpoint_step000640000.pth output/text_list.txt \ output

After training samples on MALIBAS data, samples are around 180. I didn't get checkpoints in pre folder .. For synthesising which checkpoints should i use. I though checkpoints will be restored in 20180505_deepvoice3_checkpoint_step000640000.pth But after synthesising I didn't saw any difference trained data and provided samples. I used 20180505_deepvoice3_checkpoint_step000640000.pth this pre-trained model for first part . And for speaker adaption I trained data my question is where should i get new checkpoints.

aishweta commented 5 years ago

@r9y9 @G-Wang not able to get checkpoints in pre folder for adaption, anyone could help me in the same.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.