lifeiteng / vall-e

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
https://lifeiteng.github.io/valle/index.html
Apache License 2.0
1.99k stars 320 forks source link

train loss of custom data #133

Open Wangzhen-kris opened 1 year ago

Wangzhen-kris commented 1 year ago

Hi,

I tried to train on my dataset, but I seem to have an abnormal loss curve. Do you have any suggestions? Thanks.

The loss of AR: https://drive.google.com/file/d/1-gZJX-mwYZ-2vkKTl0dTwBcp1A8MHrmV/view?usp=drive_link image The loss of NAR: https://drive.google.com/file/d/1-9L_AQZyyAgDRqKPpx06w6M99ZPSUIhe/view?usp=drive_link image

RuntimeRacer commented 1 year ago

Hi @Wangzhen-kris, what kind of data does your dataset consist of? Is it by any chance containing very diverse speakers or even multiple languages? Also, are they organized into separate cut sets which were combined for training?

While trying to train on Apache CommonVoice I ran into similar graphs. I found out that the usage of the Lhotse Dynamic samplers leads to the issue of static CutSet order - Which means Language C always gets trained after B, which is trained after A. Also this leads to the Model biasing a lot towards the CutSet it was trained last on. For example, all my Inference tests at the end of one epoch had a french dialect.

I figured a solution for this, by randomizing the CutSet contents before training. It is quite Memory Intensive on a large dataset (~60 GB needed for almost complete CommonVoice 13) and also quite slow since it's a single threaded process. Takes about 10 Minutes on my AI server. I still want to make this a bit better, for example, have it resample after each epoch (currently it does once at trainings start, and only if there is no randomized file already). But you could have a look at my branch; maybe it's helpful for you:

https://github.com/lifeiteng/vall-e/compare/main...RuntimeRacer:vall-e:cuts_randomizer

Also I attached a screenshot how this stabilized my training; the arrows point to where this was applied after 2 epochs without this pre-processing:

image

MajoRoth commented 8 months ago

Im facing the same issue and trying to debug. What causes Lhotse dynamic samplers to load in order? im shuffling the files in the tokenization part, and using shuffle=True, but still getting weird loss graphs that indicate that something is wrong:

image image

this patterns occurs each epoch... any clues?