konstmish / prodigy

The Prodigy optimizer and its variants for training neural networks.
MIT License
320 stars 20 forks source link

Question on convergence #18

Open ppbrown opened 6 months ago

ppbrown commented 6 months ago

Disclaimer: very new at all this. I'm coming from the perspective of an inference user, when on a good model using a nice prompt and sampler, there tends to be a number of steps past which, it will always converge to a nice stable image for a given seed.

I think I've read that there is some similar effect in (SDXL specifically) model training, where if you are doing things right, you will get "convergence" for training the resulting model. I take it to mean that I would see similar things if I viewed the per-epoch sample images; they would converge to something decent over the course of the epochs.

But... I'm not seeing that happening. For example, if I run a 100 epoch training over 80 images, I will see something reasonable congeal around maybe epoch 25.... and then it gets mushy for a while.. and then things come back into focus around epoch 80.

I tried turning on "safeguard_warmup" and "bias correction".. but the overall effect of those combined, seemed to be to just stretch out the training cycle. Now the things that happened at 80, happen at 2x80=160 epochs. (literally almost the same images)

Are my expectations off? Is it reasonable to believe that there ARE settings that will not only have a convergence, but will converge on something sane looking, given a good input dataset?

Im using onetrainer with the following settings at present:

scheduler: constant
learning rate: 1
warmup steps: 200
learning rate cycles: 1
epochs 100
batchsize: 1
gradient accumulation steps: 1
ema: GPU - ema decay 0.999 - update step interval: 5

beta1: 0.9
beta2: 0.999
d0: 3e-06
d_coeficient: 1
decouple: true
eps: 1e-08
relative step: false
safeguard: (true/false)
bias_correction: (true/false)

weight decay: 0.01
riffmaster-2001 commented 2 months ago

have you tried COSINE for onetrainer? the paper also says use Cosine Annealing and onetrainer's COSINE is equivalent (reference: https://github.com/Nerogar/OneTrainer/issues/214)

Also try increasing your d_coeficient to 2.0 or 3.0 which will let it jump to higher LRs as it searches for the right LR. Watch tensorboard to see where it's jumping to

ppbrown commented 2 months ago

I'm confused. I though the "COSINE" stuff, actually reset the LR and made for MORE varience. Which intuitively makes me think its for the opposite ?

"Cosine Annealing is a type of learning rate schedule that has the effect of starting with a large learning rate that is relatively rapidly decreased to a minimum value before being increased rapidly again"