konstmish / prodigy

The Prodigy optimizer and its variants for training neural networks.
MIT License
340 stars 21 forks source link

Question on convergence #18

Open ppbrown opened 7 months ago

ppbrown commented 7 months ago

Disclaimer: very new at all this. I'm coming from the perspective of an inference user, when on a good model using a nice prompt and sampler, there tends to be a number of steps past which, it will always converge to a nice stable image for a given seed.

I think I've read that there is some similar effect in (SDXL specifically) model training, where if you are doing things right, you will get "convergence" for training the resulting model. I take it to mean that I would see similar things if I viewed the per-epoch sample images; they would converge to something decent over the course of the epochs.

But... I'm not seeing that happening. For example, if I run a 100 epoch training over 80 images, I will see something reasonable congeal around maybe epoch 25.... and then it gets mushy for a while.. and then things come back into focus around epoch 80.

I tried turning on "safeguard_warmup" and "bias correction".. but the overall effect of those combined, seemed to be to just stretch out the training cycle. Now the things that happened at 80, happen at 2x80=160 epochs. (literally almost the same images)

Are my expectations off? Is it reasonable to believe that there ARE settings that will not only have a convergence, but will converge on something sane looking, given a good input dataset?

Im using onetrainer with the following settings at present:

scheduler: constant
learning rate: 1
warmup steps: 200
learning rate cycles: 1
epochs 100
batchsize: 1
gradient accumulation steps: 1
ema: GPU - ema decay 0.999 - update step interval: 5

beta1: 0.9
beta2: 0.999
d0: 3e-06
d_coeficient: 1
decouple: true
eps: 1e-08
relative step: false
safeguard: (true/false)
bias_correction: (true/false)

weight decay: 0.01
riffmaster-2001 commented 3 months ago

have you tried COSINE for onetrainer? the paper also says use Cosine Annealing and onetrainer's COSINE is equivalent (reference: https://github.com/Nerogar/OneTrainer/issues/214)

Also try increasing your d_coeficient to 2.0 or 3.0 which will let it jump to higher LRs as it searches for the right LR. Watch tensorboard to see where it's jumping to

ppbrown commented 3 months ago

I'm confused. I though the "COSINE" stuff, actually reset the LR and made for MORE varience. Which intuitively makes me think its for the opposite ?

"Cosine Annealing is a type of learning rate schedule that has the effect of starting with a large learning rate that is relatively rapidly decreased to a minimum value before being increased rapidly again"

konstmish commented 2 weeks ago

Sorry for the late response, I realize you might not be interested in it anymore but posting since others might find this issue.

In theory, the images might be changing over the course of training many times without converging to anything specific. This is because as long as the model is training, its parameters keep changing, this effect should be also visible with Adam and its variants (Prodigy itself is a variant of Adam with on-the-fly estimation of the learning rate). As mentioned, cosine annealing with safeguard_warmup might forces the optimizer to slow down eventually, though the exact images you'll see depend on other settings, especially on the length of the training.

I'd also like to add that we won't really know why images may get mushy or what makes images converge to any other distribution. What we should expect in general is that the loss should go down with enough trainable parameters in the model, but how this will affect things down the line is usually unclear.

ppbrown commented 2 weeks ago

Thanks for the reply. right now I'm playing with dadapt-lion, since lower memory requirements. With that one I'm seeing it increasing for a while, then plateau... then increasing again eventually. For that one, I read that it is actually developed with cosine scheduler in mind, so it almost makes sense. But I didnt think prodigy was like that too? Both of them seem to do super well, with shorter runs, eg: 10k steps or less. But if you get up to 20k, feels like it overtrains.

konstmish commented 2 weeks ago

Prodigy is based on Adam/AdamW, which works better with cosine annealing.

Overtraining is something that's very hard to prevent with an optimizer. Weight decay might help with that but it also matters if you use data augmentation, how big/diverse your dataset is, etc.

Btw, dadapt-lion is probably the most heuristic optimizer among those with learning rate estimation, and its behaviour is not well understood.