Every day, I transcribe more audio clips of this new speaker. I assume that transcribing more and more clips will lead to better results. (And I remember that Linda Johnson recorded close to 24 hours of audio samples of sentences with considerable variety.)
@r9y9
I wonder if you know or could guess the answer to these questions:
Is it ok that I lowered the "batch size" within deepvoice3_ljspeech.json to 10? (I did that because CUDA kept running out of memory and crashing.)
How many minutes of transcribed audio clips would let me hear a result that sounds like the new speaker? (@G-Wang said 1.5 hours, and @Kyubyong Kyubyong Park says just 1 minute!)
How many "steps" does the checkpoint need to have before the new speaker will be trained enough to sound good? (I don't know what a step means.)
After transcribing another X minutes of audio samples, I run python preprocess.py json_meta "C:\code\voice_cloning\audio\alignment.json" "./data/fresh" --preset=presets/deepvoice3_ljspeech.json. So then:
Is it okay for me to resume training by running python train.py --data-root=./data/fresh --checkpoint-dir=checkpoints_fresh --preset=presets/deepvoice3_ljspeech.json --log-event-path=log/fresh --checkpoint="checkpoints_fresh\checkpoint_step000017000.pth" --speaker-id=0, or must I start over from scratch every time I've run preprocess.py since I've added more transcriptions?
Should I be monitoring anything and adjusting my approach somehow based on results? (I don't understand what the graphs like step000025000_text4_single_alignment.png represent.)
I really appreciate your help.
And as a thank-you, I want to share a tool that I just built and have been using for the past couple days to make transcriptions super fast. It uses an API to pull in surprisingly accurate speech-to-text (Google's wasn't good enough in my experience):
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I have 70 minutes of transcribed audio clips for a new speaker. The clips are each a max of 10 seconds.
I started with Linda Johnson as a basis by running:
python train.py --data-root=./data/fresh --checkpoint-dir=checkpoints_fresh --preset=presets/deepvoice3_ljspeech.json --log-event-path=log/fresh ----restore-parts="data\LJSpeech_1_1\20180505_deepvoice3_checkpoint_step000640000.pth" --speaker-id=0
Every day, I transcribe more audio clips of this new speaker. I assume that transcribing more and more clips will lead to better results. (And I remember that Linda Johnson recorded close to 24 hours of audio samples of sentences with considerable variety.)
@r9y9 I wonder if you know or could guess the answer to these questions:
deepvoice3_ljspeech.json
to 10? (I did that because CUDA kept running out of memory and crashing.)python preprocess.py json_meta "C:\code\voice_cloning\audio\alignment.json" "./data/fresh" --preset=presets/deepvoice3_ljspeech.json
. So then:Is it okay for me to resume training by running
python train.py --data-root=./data/fresh --checkpoint-dir=checkpoints_fresh --preset=presets/deepvoice3_ljspeech.json --log-event-path=log/fresh --checkpoint="checkpoints_fresh\checkpoint_step000017000.pth" --speaker-id=0
, or must I start over from scratch every time I've runpreprocess.py
since I've added more transcriptions?step000025000_text4_single_alignment.png
represent.)I really appreciate your help.
And as a thank-you, I want to share a tool that I just built and have been using for the past couple days to make transcriptions super fast. It uses an API to pull in surprisingly accurate speech-to-text (Google's wasn't good enough in my experience):
https://send.firefox.com/download/00119bffbe/#ehNtuTyv9KIumI_VTdj7Dg
I hope it helps.