Closed yrahul3910 closed 5 years ago
I think that's too little data for speaker adaptation, the smallest data I've had is around 150 voice samples. Also you shouldn't be training on the small dataset for that that long, that's why your model is completely degenerate.
For my example it only took about 10-20 epochs through the 150 voice samples to get good adaptation, training longer will only degrade the model to overfit to the small data samples.
Also if you can, try to get samples with variation in words, for example from the Harvard Sentence list http://www.cs.cmu.edu/afs/cs.cmu.edu/project/fgdata/Recorder.app/utterances/Type1/harvsents.txt
I see. I've doubled my data to about 60 samples, but I can definitely get more. It's a much better graph now, though it's not quite there yet. Are you sure you only ran 10-20 epochs? That would take just a few seconds, if I'm not mistaken, right? Seems awfully small. I'm basically just training till the loss function seems like a small value.
I'm just curious why you expect it works with 23 samples... End-to-end models are hard to optimize and data hungry! Anyway, I would try freezing seq2seq model parameters and train only the postnet part. That could stabilize training.
@G-Wang what is the total duration of all samples(150 samples) you trained for ?
I'm curious too...I re-trained with about 4m of data and I'm getting noticeably better results, but I want to see what the minimum threshold is.
@yrahul3910 i have problem in the synthesis voice for speaker adaption . It is synthesis empty .wav file and checkpoint is not saving properly. Do you preprocess your data with gentle or json meta.
can you give me rough idea on speaker adaption. from preparing dataset
@navinthenapster Uh damn, it's been a while, not sure I remember. Okay, so I definitely preprocessed using json meta, and I remember initially getting empty wav files as output, but for me the issue was that my voice samples (for adaptation) were not loud enough, so I threw them out and made another 4m of audio in about 60 samples, but making sure they were much louder (if you're not sure, I used lyrebird.ai for a bunch of sentences, plus it tells you if you're not loud enough).
As for the dataset, I simply took the pretrained model, and ran train.py
on it...I don't remember what options I gave it, but they're definitely in the docs somewhere....then like I said, I used lyrebird.ai to get some sentences, downloaded my wav files, and just ran the model for a little while....I think I went till 1400 steps...pretty decent results, but I kinda just lost interest I think, maybe I'll get back to it sometime.
Thanks for quick reply
@navinthenapster mine total duration was about 20-ish minutes.
@yrahul3910 @navinthenapster Have you both got good results?
I think my checkpoints not gets saved, as I ran :+1:
training python train.py --data-root=/home/ubuntu/shweta/voice-cloning/deepvoice3_pytorch/data/ljspeech --checkpoint-dir=pre --preset=presets/deepvoice3_ljspeech.json --log-event-path=log --restore-parts="/home/ubuntu/shweta/voice-cloning/deepvoice3_pytorch/pre/20180505_deepvoice3_checkpoint_step000640000.pth" --speaker-id=0
synthesizing python synthesis.py /pre/20180505_deepvoice3_checkpoint_step000640000.pth output/text_list.txt \ output
After training samples on MALIBAS data, samples are around 180. I didn't get checkpoints in pre folder .. For synthesising which checkpoints should i use. I though checkpoints will be restored in 20180505_deepvoice3_checkpoint_step000640000.pth But after synthesising I didn't saw any difference trained data and provided samples. I used 20180505_deepvoice3_checkpoint_step000640000.pth this pre-trained model for first part . And for speaker adaption I trained data my question is where should i get new checkpoints.
@r9y9 @G-Wang not able to get checkpoints in pre folder for adaption, anyone could help me in the same.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi, I tried adapting the pre-trained DeepVoice3 model to a dataset with only 23 voice samples (about 2 minutes) of only one speaker, using the LJSpeech preset. Does DeepVoice3 require more audio samples? After training for 1100 steps (about 4 hours on my system), it produced practically empty audio: Do I need more voice samples? Is there a rough figure for the same?