rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
4.37k stars 297 forks source link

Some improvements / bug fixes #476

Open mitchelldehaven opened 3 weeks ago

mitchelldehaven commented 3 weeks ago

Thanks for the repo, its really awesome to use!

Using it I noticed a few problems / personal preferences that I made to improve it. I thought I would create a PR with these changes:

If you want some of these changes pulled in, but not others, I can remove the commits you don't want pulled in, if you want any of these changes at all (the only actual bug commits are 379bee117d937857d3529b39600df1d3e3c9130e, e6af4c6dd223064e443b42de3e6345da4e903535, de98f64cc726cc3e6119679c01fd70f944f4668e).

Below is the reported time difference, likely all due to e6af4c6dd223064e443b42de3e6345da4e903535 on a small 10 epoch finetuning run when using 200 test samples and ~ 200 validation samples

# 10 epoch finetune
time python -m piper_train     --dataset-dir path/to/my/data     --accelerator 'gpu'     --devices 1     --batch-size 16     --validation-split 0.1     --num-test-examples 200     --max_epochs 6689     --resume_from_checkpoint "path/to/amy-epoch=6679-step=1554200.ckpt"     --checkpoint-epochs 1     --precision 32 --max-phoneme-ids 400
...
real    23m31.791s
user    21m30.965s
sys     1m35.897s

# Fixed problem finetune 10 epochs
time python -m piper_train     --dataset-dir path/to/my/data     --accelerator 'gpu'     --devices 1     --batch-size 16     --validation-split 0.1     --num-test-examples 200     --max_epochs 6689     --resume_from_checkpoint "path/to/amy-epoch=6679-step=1554200.ckpt"     --checkpoint-epochs 1     --precision 32 --max-phoneme-ids 400
...
real    9m31.955s
user    8m33.749s
sys     1m1.055s

EDIT: I don't have a multiple GPU setup, so I didn't test any of this on DP or DDP, specifically 8eda7596f3857b6e160284603bbc06cb3ce77a31 could cause issues to multi-gpu setups. I also haven't tested not using a validation set, but I can test to make sure it still works fine.

bzp83 commented 3 weeks ago

this should be merged asap! edit: it breaks preprocessing https://github.com/rhasspy/piper/pull/476#issuecomment-2075033495

I'm already using this PR on a local training.

@mitchelldehaven could you explain more about the usaged of --patience please? I didn't know about EarlyStopping... Still learning :)

bzp83 commented 3 weeks ago

I don't know where it could be fixed, but it would look better to have a space or line break between '2' and 'Metric', it's printing the logs after the progress like:

8/48 [00:23<00:00, 2.06it/s, loss=26.1, v_num=2Metric val_loss improved by 1.927 >= min_delta = 0.0. New best score: 51.278

and I also don't see the matching ']' to close the the opening one after '8/48'. Not sure if this is intended.

bzp83 commented 3 weeks ago

one more update... this PR breaks preprocessing, it failed with: RuntimeError: Currently, AutocastCPU only support Bfloat16 as the autocast_cpu_dtype

mitchelldehaven commented 3 weeks ago

@bzp83

The --patience flag essentially tells you how many epochs can elapse without an improvement in the best validation loss before it terminates training. I was using a patience of 10, so if your best validation loss occurs on epoch 50, training will continue until at least epoch 60. If the validation loss for the last 10 epochs did not achieve a better loss value than epoch 50, training will terminate.

The idea is essentially that if you achieve a low validation loss epoch and cannot beat it for several successive epochs, likely your validation loss improvements have plateaued, and further training will not yield improvements. You could try higher values, like maybe 20 or something, however I only have ~ 2000 training samples, so overfitting happens somewhat quickly when starting from a checkpoint.

I can take a look at the preprocessing issues after my workday today, likely the fix is simply changing enabled=True to enabled=False in the autocast contexts.

rmcpantoja commented 3 weeks ago

Thanks for the repo, its really awesome to use!

Using it I noticed a few problems / personal preferences that I made to improve it. I thought I would create a PR with these changes:

  • 379bee1: The way the callbacks are being set results in the progress bar not showing for Pytorch Lightning during training (not sure if this is intentional)
  • 555e8aa: This is just personal preference, but having early stopping is nice since I found that starting from a checkpoint, it starts to plateau after about ~100-200 epochs, so stopping early saves compute.
  • d8655a1: This is also just personal preference, but since early stopping isn't used and the model could start to overfit substantially after hundreds of epochs, saving the best result on the validation set is nice.
  • e6af4c6: Having the progress bar clued me into what was causing training to be so slow in my case. I was using 200 test samples and how it is currently setup with each validation batch, the code iterates one by one through the test samples and generates the audio. If you have 10 validation batches per test run, this will run 10 times. This simply moves it to on_validation_end so it only runs once. Likely the test audio process could be batched looking at the code, but this already saves quite a bit of time as is.
  • de98f64: abs needs to be taken before max so that the maximum negative amplitude is taken into consideration. Currently `abs is essentially a no op. This is the cause of the warnings about amplitudes being clipped.
  • 8eda759: This allows 16 to be used as a precision. From what I can tell, its only marginally faster for some reason (usually I see ~ 2x improvement in speed). There may be more going on here. Also, Pytorch 1.13 didn't have the best support for bf16, so it is actually slower than using full precision.

If you want some of these changes pulled in, but not others, I can remove the commits you don't want pulled in, if you want any of these changes at all (the only actual bug commits are 379bee1, e6af4c6, de98f64).

Below is the reported time difference, likely all due to e6af4c6 on a small 10 epoch finetuning run when using 200 test samples and ~ 200 validation samples

# 10 epoch finetune
time python -m piper_train     --dataset-dir path/to/my/data     --accelerator 'gpu'     --devices 1     --batch-size 16     --validation-split 0.1     --num-test-examples 200     --max_epochs 6689     --resume_from_checkpoint "path/to/amy-epoch=6679-step=1554200.ckpt"     --checkpoint-epochs 1     --precision 32 --max-phoneme-ids 400
...
real    23m31.791s
user    21m30.965s
sys     1m35.897s

# Fixed problem finetune 10 epochs
time python -m piper_train     --dataset-dir path/to/my/data     --accelerator 'gpu'     --devices 1     --batch-size 16     --validation-split 0.1     --num-test-examples 200     --max_epochs 6689     --resume_from_checkpoint "path/to/amy-epoch=6679-step=1554200.ckpt"     --checkpoint-epochs 1     --precision 32 --max-phoneme-ids 400
...
real    9m31.955s
user    8m33.749s
sys     1m1.055s

EDIT: I don't have a multiple GPU setup, so I didn't test any of this on DP or DDP, specifically 8eda759 could cause issues to multi-gpu setups. I also haven't tested not using a validation set, but I can test to make sure it still works fine.

Hi, Good job, this seems like significant improvements! In my fork, I did also improvements to trainer. See it in audio results and this improving checkpointing. These improvements were updated over time, but what caught my attention was the audio generation in validation.