rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
6.7k stars 491 forks source link

GPU Resource Estimation #210

Open roughhewer opened 1 year ago

roughhewer commented 1 year ago

Hello All, Have a question on GPU Resource Requirement for a Training Project I am doing with Piper.

Following the Training Guide and the video by Thorsten Müller.

Data: Single Speaker, 18,000 files, average length 3 seconds, Sample Rate 22050, LJ Speech

Batch Size 32, Number of Epochs 10000, Precision 32, Quality High

Training is resumed from the Lessac High Quality Voice Checkpoint on Hugging Face.

When running on regular CPU, facing challenges with OOM program exits. The free T4 GPU on Google Colab is not always available. Even when it is, it takes a long time to run through 1 epoch.

Trying to get an estimate of how many and what type of GPUs I can rent on Lambda Labs and how long it would take to run an epoch.

I have also read that to get a good quality clone on models like Piper and Tacotron need 100K steps (steps = batch size number of epochs, so 32 batch size 10000 epochs would be 320,000 steps); any advice there as well would be appreciated, thanks.

synesthesiam commented 1 year ago

I trained all of Piper's voices on an A6000 (48GB) and on a few 3090's (24GB). For a high quality model at batch size 32, you might be able to get away with a 3090 depending on the maximum file length.

Use --max-phoneme-ids in your training script to filter out very long files. I'd start with about 400 and work down. You will see a log message with how many files are excluded. If it's too many, you'll need to get more VRAM or lower the model quality.

Feanix-Fyre commented 1 year ago

First, in my experience, Colab has always been unreliable and Lambda costs more than Runpod for the less beefy GPUs.

TL;DR: 24GB VRAM on Runpod for another 2k-3k epochs would be my starting point in your situation, depending on audio quality.

Having trained by resuming the Lessac High on a Runpod cloud instance at 22050 --high-quality (or --high. might be mixing up my CLI arguments from different projects), I agree with the 24GB VRAM assessment for training with larger batch sizes. I finetuned it and I'm still evaluating which checkpoints to keep (or if to just improve my dataset and run again). The audio quality of my dataset (and the model) is just below studio quality so I don't think that's my concern.

ASIDE: As I'm typing here, I think recording and training at 48kHz would solve that but I'm going to investigate Neural Net Audio Upsampling/Restoration first.

I would want to finish the piper-recording-studio recording but, at 200/1150 samples, it was good enough for what I needed it for at the time. I kept the checkpoints from 3500 - 5000 epochs, saving every 100-250 for detectable differences. If I remember correctly, my 3070 Laptop (8GB VRAM but only about 7 available for use during training) could only handle a batch size of 6 or 8? And I think I was able to get a 16GB on runpod up to 12 or so? Memory is foggy but the fog is still tangible enough to contributes to the conversation.

In summary, 24GB VRAM on Runpod for another 2k-3k epochs would be my starting point in your situation, depending on audio quality.

EDIT: I can't remember if I trained at half precision or full but I strongly suspect full.

roughhewer commented 1 year ago

Thanks @synesthesiam and @Feanix-Fyre

Will try the max phenome ids argument and keep you posted The longest audio file in the dataset is about 10 words of speech, 90% of the files are under 5 words.

When I was running on Google Colab with 12GB GPU Ram on 1000 files (truncated the 18K dataset) with batch size of 4 (that’s all it could handle) it was taking a consistent 2-3 minutes for every epoch. So to run just 100 epochs it took almost 4 hours. @Feanix-Fyre You are right about the batch size at the low end of RAM. On 12-16GB , you can get max 8-12 batch size.

I was wondering how long it took you all to cycle through each epoch on 24/48 GB VRAM GPU.

How long do you think it will take roughly if I run it my set 18K, 3 second files on a 24GB VRAM GPU with a batch size of 32 and 3000 epochs high quality.

Feanix-Fyre commented 1 year ago

Not too certain but I'd start at batch size 128 and let it OOM, reduce, restart until I find the sweet spot. I just spent an hour trying to do that but ran into issues cause my dataset wasn't the right format and then I couldn't find the OG files in the disorganization and, after finally finding it, the piper preprocess stage was taking longer than i wanted so i gave up and gave you half-assed advice (batch size 128? really......) instead.

StoryHack commented 1 year ago

I'm currently about 1350 epochs into training the LJSpeech dataset on high quiality from scratch. I'm using a RTX 3060, which isn't the fastest, but it is chugging along. I get about 100 epochs a day.

Where I settled for it to work without out of memory errors is batch size at 16, max-phoneme-ids is set to 350.

SeymourNickelson commented 10 months ago

@StoryHack How did your model come out training on the RTX 3060 with those parameters?

I tried training on Colab but it won't let me finish preprocessing on higher quality GPUs like A100 (gets stuck every time).

I am able to get going with the T4 runtime but training is pretty slow on the T4. With the batch size set to 32 and max-phenomes set to 350 I'm using 14 GB out of 15 available VRAM. Training is pretty slow. If this was my own hardware I'd patiently wait without worrying but I know Colab will kill my session way before this can finish.

I think the T4 has more memory than the RTX 3060 so I was wondering how your model turned out when it finished on the RTX 3060?

StoryHack commented 10 months ago

It turned out okay. You can hear my results at: https://brycebeattie.com/files/tts/

I would like to get something set up to train on runpod, but haven't had time to figure it out.