konstmish / prodigy

The Prodigy optimizer and its variants for training neural networks.
MIT License
326 stars 20 forks source link

Question #3

Closed DarkAlchy closed 11 months ago

DarkAlchy commented 1 year ago

I am noticing a trend that has happened to my friend, and I, as we use CosinewithAnnealingLR and Prodigy, but it doesn't begin to really learn until it is almost done. I am wondering what is going on as we can never give it enough steps as the amount we give it doesn't seem to matter it always waits until almost done to learn. Love what I am seeing from it, and the logs, but this trend says we are doing something wrong.

opt = Prodigy(net.parameters(), lr=1., weight_decay=weight_decay)

We have not set a weight decay as I just noticed this part but in the CosinewithAnnealingLR we set T_Max to num_epochs and set eta_min to 1e-6 (leaving it at 0 was not nearly as good).

Can you think of anything causing this?

konstmish commented 1 year ago

Thanks for trying out the optimizer! This is not too surprising to me, I think the jump at the very end mostly comes from cosine annealing. I often observe similar behavior with SGD and Adam as well. My intuition is that we want the learning rate to be sufficiently large for most of the training since it appears to have a regularization effect. For instance, Adam can work a bit better at the beginning if we used a smaller learning rate but it would converge to a worse value of test accuracy. When using large values of weight decay, this effect is sometimes even more pronounced.

DarkAlchy commented 1 year ago

@konstmish What would you suggest to use and for the decay?

konstmish commented 1 year ago

Based on what you wrote, it might be that the estimated stepsize is too large, which you can test by setting d_coef to a value smaller than 1, for instance by passing d_coef = 0.2. It's a bit hard for me to say confidently without knowing what kind of network/data you use, but one thing you can test is how much the weights of your network change over time. If weights norm blows up and still changes even with smaller values of d_coef, it is likely that the estimated stepsize is too large. If weights norm remains more or less the same over time, the issue is with something else.

Another approach would be to write a learning rate scheduler with a warmup for a few epochs and pass safeguard_warmup=True to Prodigy, this is what we did for training ViT on Imagenet (AdamW uses warmup as well), although I did tests that showed it led to roughly the same final test accuracy as no warmup, only the intermediate values were better.

For weight decay, Prodigy sets decouple=True by default, which makes it similar to AdamW, whose default value of weight decay is 0.01. For this reason, I'd suggest trying weight_decay=0.01 or weight_decay=0.05, these values should work well on most problems, for example, we used 0.05 for ViT training on Imagenet.

Please do let me know if you get more feedback on how the method works, this is very useful!

DarkAlchy commented 1 year ago

I would pass the d_coef as an optimizer_args along with the weight_decay=0.05?

konstmish commented 1 year ago

You can try setting d_coef and weight_decay either together or separately as they have different purposes, it's up to you how much you want to play with the optimizer.

Also, do you know the final value of d in the optimizer after the training stops? It should be possible to get it using opt.param_groups[0]['d'].

DarkAlchy commented 1 year ago

No idea, but after switching to just cosine, and adding the above, the outcome was exactly the same. For the record, I am using this in Stable Diffusion. For whatever reason Prodigy seems to be the issue since adjust cosine, annealingLR, or constant didn't change what I am seeing for the exact same step training image. Matter of a fact, I did 50 epochs, and it learned absolutely nothing when used.

konstmish commented 1 year ago

Thanks for the detailed information, I can try to look into it if there is a way for me to reproduce your observations. Are you fine-tuning stable diffusion? If so, what layers do you fine-tune? If you follow a specific tutorial or there are similar training scripts, it'd be nice if you can share them.

DarkAlchy commented 1 year ago

Ahhh, no finetuning as that is too much effort for too little gain for me. This is using Kohya's sd-scripts for me, and Kohya_ss (the gui version) for my friend, and LyCORIS for me, and for my friend he tried Lora, Locon, LyCORIS I believe, though my friend tried it with Textual Inversion embedding as well to get the same results. We are stumped.

DarkAlchy commented 1 year ago

Something is wrong with Prodigy because I ditched annealing and went for constant and again epoch 80 out of 100 it suddenly began to learn. It is really weird that the last 20% (regardless the max epochs) is when it learns.

konstmish commented 1 year ago

Hmmm, that is actually very weird because, without any scheduler, the optimizer is completely agnostic to how long you run it for, it should have exactly the same behavior after a fixed number of epochs. In other words, the optimizer is not aware of the total number of epochs when there is no scheduler.

DarkAlchy commented 1 year ago

In training the scheduler is a constant

image

DarkAlchy commented 1 year ago

@konstmish I have a real oddity happen 4 times. I took away annealing and it worked very well BUT what it learned was almost duplicates of my training images ignoring my prompt completely. I tried an adamw training and it worked. Cool thing is it learned as of the first save (every 2 epochs for a grand total of 20 epochs) when I used the constant instead of annealing, but it is as if it nevered learned the TE so it just ignored my prompts and shoved at me the unet (image) it had learned. They were almost carbon copies of the training images.

konstmish commented 1 year ago

What were the parameters that you used for AdamW? AdamW sets weight_decay=0.01 by default, so if you didn't manually set it to 0 in AdamW, this might be the difference between Prodigy and AdamW.

DarkAlchy commented 1 year ago

I think we are going off the rails here. When I used Prodigy I set the weight_decay=0.01 so it would be identical to the AdamW. I test with as much of the same as I can between them.

konstmish commented 1 year ago

Can you also share what learning rate you used for AdamW?

DarkAlchy commented 1 year ago

1e-4/5e-5 (Unet and TE)

DarkAlchy commented 1 year ago

I just tried a lot of new parameters for this I did not even know existed, and it learned far faster. 10 images for 40 epochs the old way vs prodigy and prodigy learned (still had more to learn). I want to try retries (not annealing as I feel that was implemented wrong after a lot of testing that my friend, and I, have done OR we were never given all parameters we need).

What is a good set of restarts to use with cosine (I was using cosine and no restarts)?

konstmish commented 1 year ago

Thanks for trying this out! Can you mention what exactly are the parameters that you found useful to tweak?

To be honest, I've never tried restarts with the method, usually it only improves the performance of early models with no impact on the final performance. The original paper on cosine annealing suggested using restarts either infrequently (for example, every 25% of training) or double the length of each period, e.g., doing restarts after epochs 1, 3, 7, 15, 31, 63, etc.

DarkAlchy commented 1 year ago

I just tried cosine with restarts and had it 1 restart per epoch, so you are saying that for annealing it is saying make that Epoch/2 or epoch/4?

["decouple=True","weight_decay=0.01","d_coef=2","use_bias_correction=True","safeguard_warmup=True"]

I had the first two and it was failure after failure for both of us that we gave up. I just tried that for four trainings and those last three seems to have been what was needed.

konstmish commented 1 year ago

Yeah, the paper on cosine annealing didn't do restarts every epoch, it either doubled the gap between restarts or just used epoch/2 or epoch/4.

DarkAlchy commented 1 year ago

No way to double the gap so changing the period is not an option, leaving /2, or /4. This works for regular cosine as well? I do not have any success with annealing, and it slows everything down.

konstmish commented 1 year ago

What do you mean by "regular cosine"? What specific scheduler are you using?

DarkAlchy commented 1 year ago

cosine the one that has been around for eons, and long before this annealing stuff.

konstmish commented 1 year ago

can you give a link?

DarkAlchy commented 1 year ago

It was renamed apparently but the old Cosine function/scheduler (still called just cosine in the trainers) is this one image While the newer one is this one image

What is funny is that the bottom one is actually just a cosine function without any restarts that we only had. It gets very confusing when names change or the names used in the trainers was wrong.

DarkAlchy commented 1 year ago

An example is that in the trainers since Stable Diffusion released (from Dreambooth to now the LoRA types) it was cosine, cosine_with_restarts to now the additional CosineAneallingLR.

Poiuytrezay1 commented 1 year ago

Following on this, I was using the same scheduler and implementation (from kohya scripts for stable diffusion) as @DarkAlchy and got this lr graph: image

This is probably not supposed to happen, but I didn't find any LR graph to compare it within the paper. Could you show what sort of graph Prodigy is supposed to produce with CosineAnnealingLR?

DarkAlchy commented 1 year ago

Yes, after running that, as well as a friend, we saw this exact same behaviour that I have not seen before. On one of them it quickly shot up did that weird squiggly then was fine. I think it has to do with warmups though this is new behaviour.

konstmish commented 1 year ago

The moments when the learning rate jumps to a high value are the "restarts" in cosine annealing. I personally never use them, I think it's better to use torch.optim.lr_scheduler.CosineAnnealingLR with T_max argument set to the total number of epochs. Also make sure that the scheduler.step() call is done only once per epoch, for instance, as done in this pytorch tutorial.

DarkAlchy commented 1 year ago

The moments when the learning rate jumps to a high value are the "restarts" in cosine annealing. I personally never use them, I think it's better to use torch.optim.lr_scheduler.CosineAnnealingLR with T_max argument set to the total number of epochs. Also make sure that the scheduler.step() call is done only once per epoch, for instance, as done in this pytorch tutorial.

That is not what we are talking about, as I mentioned above, it is that squiggly line stuff.

Poiuytrezay1 commented 1 year ago

The moments when the learning rate jumps to a high value are the "restarts" in cosine annealing. I personally never use them, I think it's better to use torch.optim.lr_scheduler.CosineAnnealingLR with T_max argument set to the total number of epochs. Also make sure that the scheduler.step() call is done only once per epoch, for instance, as done in this pytorch tutorial.

So, if I understand correctly, the graph I was showing is correct, and Prodigy has no impact on the LR after restarts, right? In this case, we should definitely use no restarts at all.

PS: Sorry if I come across as clueless, I come from a community with a lot of conflicting information, and it is hard to understand the different components we use. I tried to read the paper, but I'm definitely missing a lot of knowledge to understand the different terms used in the formulas.

DarkAlchy commented 1 year ago

The moments when the learning rate jumps to a high value are the "restarts" in cosine annealing. I personally never use them, I think it's better to use torch.optim.lr_scheduler.CosineAnnealingLR with T_max argument set to the total number of epochs. Also make sure that the scheduler.step() call is done only once per epoch, for instance, as done in this pytorch tutorial.

So, if I understand correctly, the graph I was showing is correct, and Prodigy has no impact on the LR after restarts, right? In this case, we should definitely use no restarts at all.

PS: Sorry if I come across as clueless, I come from a community with a lot of conflicting information, and it is hard to understand the different components we use. I tried to read the paper, but I'm definitely missing a lot of knowledge to understand the different terms used in the formulas.

Yep, but it would be nice if we had an answer as to what in the heck is that squiggly line nonsense that you showed? Happens most times at the beginning but not always.

artificialguybr commented 1 year ago

@konstmish Stable Diffusion is divided into UNET and Text Encoder. Text Encoder overfits much faster than UNET.

Is there any way to set different values with Prodigy?

I can set different LR values with Adaw for Unet and another LR value for Text Encoder.

Is there any way of doing this? Using values below 1.0 or will the Prodigy do it automatically? What you recommend?

DarkAlchy commented 1 year ago

@konstmish Stable Diffusion is divided into UNET and Text Encoder. Text Encoder overfits much faster than UNET.

Is there any way to set different values with Prodigy?

I can set different LR values with Adaw for Unet and another LR value for Text Encoder.

Is there any way of doing this? Using values below 1.0 or will the Prodigy do it automatically? What you recommend?

I thought I saw that in one of these. I see trainers within the Stable Diffusion world have switched away from Prodigy into Adafactor, ncluding myself, due to it being more memory efficient (so I read but not yet tested) so 24GB video cards can even train SDXL.

How does Adaw (not meaning the now ancient AdamW, or AdamW8bit but the newer versions with Adam in their names) compare in that respect against Adafactor and Prodigy?

konstmish commented 1 year ago

@jvkap the easiest way to force Prodigy to use a smaller learning rate for one of the networks is to set d_coef=0.5 or any other number smaller than 1, such as d_coef=0.1 or even d_coef=0.01. This should be more efficient at slowing down the network convergence than changing the lr hyper-parameter because Prodigy often compensates for small values of lr by estimating a larger value of d.

konstmish commented 1 year ago

@DarkAlchy I think that Prodigy's memory complexity could be reduced in a way similar to Adafactor, but frankly we don't have the capacity to study this at the moment, our next iteration is going to be about some other aspects of training. I'm also not up-to-date with respect to what's best in terms of memory efficiency. You can try using DoWG or Layer-wise DOWG, which requires no extra memory, but it seemed a bit worse in my experiments.

If anybody is willing to pick this up and try to make it memory-efficient, I'm happy to help.

DarkAlchy commented 1 year ago

@DarkAlchy I think that Prodigy's memory complexity could be reduced in a way similar to Adafactor, but frankly we don't have the capacity to study this at the moment, our next iteration is going to be about some other aspects of training. I'm also not up-to-date with respect to what's best in terms of memory efficiency. You can try using DoWG or Layer-wise DOWG, which requires no extra memory, but it seemed a bit worse in my experiments.

If anybody is willing to pick this up and try to make it memory-efficient, I'm happy to help.

I do not think those are even a choice for training Stable Diffusion, so not implemented anywhere that I know of else I would give it a spin.

sangoi-exe commented 1 year ago

@jvkap the easiest way to force Prodigy to use a smaller learning rate for one of the networks is to set d_coef=0.5 or any other number smaller than 1, such as d_coef=0.1 or even d_coef=0.01. This should be more efficient at slowing down the network convergence than changing the lr hyper-parameter because Prodigy often compensates for small values of lr by estimating a larger value of d.

God, that was exactly what I was looking for! Thanks!

I needed the LR of the TE to be at least 10 times lower than the LR of the Unet, and most of the 'guides' on the internet are all based on guesswork.

In fact, the only guide that suggested something about the d_coef said to set it at 2! Totally the opposite of what you instructed in your answer.

With the d_coef at 1~2, Prodigy fry the neural network; the first sample already come out all deformed.