r9y9 / deepvoice3_pytorch

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models
https://r9y9.github.io/deepvoice3_pytorch/
Other
1.97k stars 484 forks source link

How to improve quality over time with more transcriptions? #105

Closed ryancwalsh closed 5 years ago

ryancwalsh commented 6 years ago

I have 70 minutes of transcribed audio clips for a new speaker. The clips are each a max of 10 seconds.

I started with Linda Johnson as a basis by running:

python train.py --data-root=./data/fresh --checkpoint-dir=checkpoints_fresh --preset=presets/deepvoice3_ljspeech.json --log-event-path=log/fresh ----restore-parts="data\LJSpeech_1_1\20180505_deepvoice3_checkpoint_step000640000.pth" --speaker-id=0

Every day, I transcribe more audio clips of this new speaker. I assume that transcribing more and more clips will lead to better results. (And I remember that Linda Johnson recorded close to 24 hours of audio samples of sentences with considerable variety.)

@r9y9 I wonder if you know or could guess the answer to these questions:

  1. Is it ok that I lowered the "batch size" within deepvoice3_ljspeech.json to 10? (I did that because CUDA kept running out of memory and crashing.)
  2. How many minutes of transcribed audio clips would let me hear a result that sounds like the new speaker? (@G-Wang said 1.5 hours, and @Kyubyong Kyubyong Park says just 1 minute!)
  3. How many "steps" does the checkpoint need to have before the new speaker will be trained enough to sound good? (I don't know what a step means.)
  4. After transcribing another X minutes of audio samples, I run python preprocess.py json_meta "C:\code\voice_cloning\audio\alignment.json" "./data/fresh" --preset=presets/deepvoice3_ljspeech.json. So then:

Is it okay for me to resume training by running python train.py --data-root=./data/fresh --checkpoint-dir=checkpoints_fresh --preset=presets/deepvoice3_ljspeech.json --log-event-path=log/fresh --checkpoint="checkpoints_fresh\checkpoint_step000017000.pth" --speaker-id=0, or must I start over from scratch every time I've run preprocess.py since I've added more transcriptions?

  1. Should I be monitoring anything and adjusting my approach somehow based on results? (I don't understand what the graphs like step000025000_text4_single_alignment.png represent.)

I really appreciate your help.

And as a thank-you, I want to share a tool that I just built and have been using for the past couple days to make transcriptions super fast. It uses an API to pull in surprisingly accurate speech-to-text (Google's wasn't good enough in my experience):

https://send.firefox.com/download/00119bffbe/#ehNtuTyv9KIumI_VTdj7Dg

I hope it helps.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.