LearningRate-Free Learning Algorithm

BootsofLagrangian commented 1 year ago

Hi, how about D-adaptation?

This is a kind of algorithm that end-user doesn't need to set specific learning rate.

In short, D-adaptation use boundedness to find proper learning rate.

So, it might be useful to someone who hard to find hyperparameters.

Before I wrote this issue, I implement D-adaptation optimizer(Adam) for LoRA. It works!

A few code need to implementation. But I don't know all about sd-scripts code, there exists hard codings.

Requirement for D-dataptation is only torch>=1.5.1 and pip install dadaptation.

Here are codes.

In train_network.py from torch.optim as optim # using for a raw learning rate scheduler import dadaptation and I hard-coded for applying optimizer. optimizer = optimizer_class(trainable_params, lr=args.learning_rate) to optimizer = dadaptation.DAdaptAdam(trainable_params, lr=1.0, decouple=True, weight_decay=1.0) Setting decople=True means that optimizer is AdamW not Adam. and weight_decay is for l2 penalty.

Other argumentation is not for end-user.(maybe)

And trainable_params doesn't need a specific learning rate, so replace trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.unet_lr) to trainable_params = network.prepare_optimizer_params(None, None)

In sd-scripts, lr_scheduler is a return of get_scheduler_fix function.

But I don't know why using get_scheduler_fix interrupt D-adaptation,

so I override lr_scheduler to LambdaLR. sorry for hard coding again :)

lr_scheduler = optim.lr_scheduler.LambdaLR(optimizer=optimizer, lr_lambda=[lambda epoch: 1, lambda epoch: 1], last_epoch=-1, verbose=False)

For monitoring dlr value,

logs['lr/d*lr'] = optimizer.param_groups[0]['d']*optimizer.param_groups[0]['lr']

might be needed. All things done.

0c09e55bc847abe5d30bbcf1d03cdb19dd59e7e6760ff0576c66e35d6eae5600

This image is d*lr-step graph when I use D-dadaptation.

I trained LoRA using D-adaptation, result is here.

Thank you!

AI-Casanova commented 1 year ago

@BootsofLagrangian Would you be able to fork the repo and commit your changes so it would be easier for plebs like me to follow your changes?

BootsofLagrangian commented 1 year ago

@BootsofLagrangian Would you be able to fork the repo and commit your changes so it would be easier for plebs like me to follow your changes?

Sorry, I'm not familiar with github. It takes long time to make a fork or repo.

and this changes include hard codes, does it matter if there's something like this?

AI-Casanova commented 1 year ago

When you fork, the code becomes your own, and you can hard code changes into your own copy. But that's ok. Maybe I'll do it, and @ you if I have any problems.

(Also, unless you're on mobile, forking is fast and easy, click fork, click done, and boom)

BootsofLagrangian commented 1 year ago

When you fork, the code becomes your own, and you can hard code changes into your own copy. But that's ok. Maybe I'll do it, and @ you if I have any problems.

(Also, unless you're on mobile, forking is fast and easy, click fork, click done, and boom)

I made a fork.

I just changed train_network.py and requirements.txt

bmaltais commented 1 year ago

Cool stuff

bmaltais commented 1 year ago

This should be added as an official feature to the project. Like it!

bmaltais commented 1 year ago

@BootsofLagrangian I see that both TE LR and UNet LR are no longer specified. Do you know if Dadatation set both to be the same? Do you know if it is possible to set them to different values is it is the same? For LoRA it used that setting TE to a smaller LR than UNet was better. Not sure how this is doing it for each.

AI-Casanova commented 1 year ago

@bmaltais wouldn't you have to proc it twice with lr=1.0 for UNet and <1 for TE? Since in essence you have two different training problems going on at once?

From the source repo Set the LR parameter to 1.0. This parameter is not ignored, rather, setting it larger to smaller will directly scale up or down the D-Adapted learning rate. Sounds like 1.0 and 0.5 would match the settings commonly used (1e-4 and 5e-5)

And maybe Dadaptation is most suited for UNet, since under fitting the text encoder is often desirable.

AI-Casanova commented 1 year ago

@BootsofLagrangian you're awesome! Can't wait to play with it!

BootsofLagrangian commented 1 year ago

@BootsofLagrangian I see that both TE LR and UNet LR are no longer specified. Do you know if Dadatation set both to be the same? Do you know if it is possible to set them to different values is it is the same? For LoRA it used that setting TE to a smaller LR than UNet was better. Not sure how this is doing it for each.

Yes, using lr argumentation make TE LR and UNet LR different. @AI-Casanova's comment is also right.

I'm not sure, but using get_scheduler_fix function in train_network.py properly is the way to applying LRs differently.

or directly lr_scheduler = optim.lr_scheduler.LambdaLR(optimizer=optimizer, lr_lambda=[lambda epoch: 0.5, lambda epoch: 1], last_epoch=-1, verbose=False)

bmaltais commented 1 year ago

I also discovered there are two other adaptative method. I was shocked at how high the SGD method ramped up the LR to (1.03e+00) but the results were still good. My god!

Sample from SGD training:

grid-0432

bmaltais commented 1 year ago

Link to python module for reference: https://pypi.org/project/dadaptation/

AI-Casanova commented 1 year ago

I intuitively knew that there must be a way of adjusting learning rate in a context dependent manner, but knew I was far too uninformed to come up with one. This is definitely cool stuff.

bmaltais commented 1 year ago

Quick comparison results from DAdaptAdam with TE:0.5 and UNet:1.0:

DAdaptAdam-1-1: loss: 0.125, dlr: 4.02e-05 DAdaptAdam-0.5-1: Loss: 0.124, dlr: 4.53e-05

DAdaptAdam-1-1: grid-0436

DAdaptAdam-0.5-1: grid-0434

I think the winner is clear. TE LR need to be half of UNet... but there might be more optimal settings.

Optimizer config for both was: optimizer = dadaptation.DAdaptAdam(trainable_params, lr=1.0, decouple=True, weight_decay=0, d0=1e-6)

I will redo the same test but with an optimizer config of: optimizer = dadaptation.DAdaptSGD(trainable_params, lr=1.0, weight_decay=0, d0=1e-6)

AI-Casanova commented 1 year ago

@bmaltais how did you implement the split learning rate? Or did you run it twice?

bmaltais commented 1 year ago

@AI-Casanova I did it with

lr_scheduler = optim.lr_scheduler.LambdaLR(optimizer=optimizer, lr_lambda=[lambda epoch: 0.5, lambda epoch: 1], last_epoch=-1, verbose=False)

AI-Casanova commented 1 year ago

@bmaltais awesome! I should have pulled on that thread, but my self taught lr for all things python and ML is already through the roof. 😅

bmaltais commented 1 year ago

Here is an interesting finding. For DAdaptSGD having a TE and UNet lambda both at 1 is better than 0.5,1...

DAdaptSGD-1-1: grid-0438

DAdaptSGD-0.5-1: grid-0437

I wonder if having a weaker UNet with DAdaptSGD might be even better... like DAdaptSGD-1-0.5

Also, I have not been able to get anything out of DAdaptAdaGrad yet.

bmaltais commented 1 year ago

And here are teh results of DAdaptSGD-1-0.5:

grid-0439

I think DAdaptSGD-1-1 is still the best config for that method.

Well... I am looking at the results and I am not so sure anymore... Maybe DAdaptSGD-1-0.5 is better...

AI-Casanova commented 1 year ago

SGD is stochastic gradient descent right? Is that the same concept as SGD=(batch=1)?

Or is SGD scheduling about not having a weight decay like Adam?

Is batch=1 even SGD with Adam?

Primary sources are impenetrable and secondary sources so unreliable on this stuff. th-3506931634

bmaltais commented 1 year ago

Good question... I don't really know. But DAdaptAdam-0.5-1 appear to produce the most likeness of all the method... so I might stick with that for now...

bmaltais commented 1 year ago

Published 1st model made with this new technique: https://civitai.com/models/8337/kim-wilde-1980s-pop-star

AI-Casanova commented 1 year ago

I'm experiencing what I think is a way overtrained TE, even at 0.5. All styling goes out the window before my UNet catches up.

I have to figure out how to log what the learning rates are independently.

AI-Casanova commented 1 year ago

So @BootsofLagrangian was outputting the TE learning rate to the progress bar and logs, so what I thought was a suspiciously high UNet lr was an insanely high TE lr

Dropped my scale to .25 .5 and trying again.

AI-Casanova commented 1 year ago

Unfortunately it's starting to look to me like I've replaced one grid search with another, with scaling factor in the place of lr

BootsofLagrangian commented 1 year ago

@AI-Casanova, you might need another learning rate scheduler. My fork use only LambdaLR(identity or scalar scaling).

This is a problem, because of no using get_shceduler_fix function in sd-scripts.

Usually, Transformer models use warmup LR scheduler.

From dadaptation repo, applying LR scheduler using before also works fine.

AI-Casanova commented 1 year ago

@BootsofLagrangian basically what I was seeing is very good likenesses being made, but they were so inflexible.

I think I might have hit the sweet spot at 0.125 0.25 though.

It still adjusts to my datasets, and is in a similar range as before.

Now I'm gonna add a few other ideas to this fork.

tsukimiya commented 1 year ago

@BootsofLagrangian I have tried the forked one and it seems to work wrong when the value of network_alpha is not equal to the value of network_dim. Is it an expected behavior that the smaller the value of network_alpha, the higher the learning rate?

When network_dim=128, network_alpha=1, data was destroyed about 50 steps were executed.

BootsofLagrangian commented 1 year ago

@BootsofLagrangian I have tried the forked one and it seems to work wrong when the value of network_alpha is not equal to the value of network_dim. Is it an expected behavior that the smaller the value of network_alpha, the higher the learning rate?

When network_dim=128, network_alpha=1, data was destroyed about 50 steps were executed.

D-adaptation use inverse of subgradient of models. If you want more equations, details are in dadaptation paper

LoRA model is multiplicated two matrix with low-rank(r) B and A.

In LoRA paper, alpha and rank used external multiplication terms of model.

Alpha used multiplying model and rank used dividing model.

So, alpha/rank ratio is very directly and sensitively acting on subgradient.

In destroyed case, alpha=1, rank=128, alpha/rank ratio is 1/128. This makes subgradient smaller.

Now, return to D-adaptation. Small subgradient makes learning rate higher. High learning rate blow model up.

Therefore, It is highly recommended alpha and rank set up same value, especially using big(?) rank value.

Thank you for comment and experiments! :)

tsukimiya commented 1 year ago

Now, return to D-adaptation. Small subgradient makes learning rate higher. High learning rate blow model up.

Therefore, It is highly recommended alpha and rank set up same value, especially using big(?) rank value.

Understood. If that is the case, it would be better to have a warning when the alpha option is specified small, etc. when actually incorporating the code.

Thanks for your reply!

rockerBOO commented 1 year ago

Screenshot 2023-02-16 at 23-51-52 TensorBoard dev Screenshot 2023-02-16 at 23-50-59 TensorBoard dev

Tried out @BootsofLagrangian fork, works really well IMO. Green is the D-Adaptation and Orange is 1e-4 learning rate, 5e-5 for text encoder. Also added regularization images for anime to the green lines. Showing below after 2000 steps. With the --noise_offset=0.1 grid-0246

shirayu commented 1 year ago

Note: some posts related to learning:

BootsofLagrangian commented 1 year ago

Usage of my fork changes.

Now, using --use_dadaptation_optimzer args activate dadaptation.

and, learning rate, UNet LR, TE LR will available args, but it is not commonly used for LR.

1-digit float value is proper value for LR, UNet LR, TE LR. ex) 1.0, 1.0, 0.5.

kohya-ss commented 1 year ago

D-Adaptation optimizer is finally implemented. Thank you to @BootsofLagrangian for the PR and thank you all for the great research!

davidhsuUI commented 1 year ago

so glad i can ignore setting LR now, thanks!! @BootsofLagrangian @kohya-ss had to look a bit, but it looks like it is under the "optimizer" setting

so does this mean if i set this option, i dont need to set LR value / it ignores it?

bmaltais commented 1 year ago

Glad you found where the option was. It is indeed nice not to have to specify the LR.

davidhsuUI commented 1 year ago

Really appreciate your efforts, however I seem to have a more success with the original post than the current one. With the original one, the results were great and Im still tweaking it. Do above posts imply that LR increases when applying a dampening factor (like network alpha) , rather than decreases like it normally does?

With the new one, LR seems really low and can't produce results resembling input images, even using 5 times the steps I normally use, but I may have mistweaked it, have you guys had success?

bmaltais commented 1 year ago

Interesting. I had a feeling the new d'adaptation was different from the one in the branch... Some day I will see if there is a way I could enable the original method vs using the old branch for the task.

rockerBOO commented 1 year ago

I have also been having issues with d-adaptation implementation. I originally used it in PR form and it was working well, but tried it recently in various testing I was doing and I couldn't get it to not explode and cause loss=nan. Also learning seems very low, even though it has a decent dlr (lower average magnitude/average strength). Tried upping the unet_lr and text_encoder_lr to 2, 1.15, 1.25 or lower to 0.75 (which I know isn't a multiplier) but still had poor results. Tried optimizer_args of "decouple=True" and/or weight_decay from 0.2, 0.1, 0.01, 1e-4, 1e-5, 1e-6 and nothing seemed to help it improve.

I will try some tests on the code from BootsofLagrangian's branch to compare. I can also try to compare the code and see if there is something that stands out to me.

BootsofLagrangian commented 1 year ago

@rockerBOO I found some change btw first fork and now in D-Adaptaion.

Now version only use one learning rate on UNet and TextEncoder.

I think this is correct method following the reference.

So, It is recommend that using lower unet learning rate(eg. 0.5) and using optimizer args "decouple=True" "weight_decay=1.0"

Because D-Adaptation uses boundedness, it will use maximally high dlr(seems like using maximal learning rate).

Anyway, try "weight_decay=1.0" on optimizer_args and lower coefficient of unet/TE lr(eg. 0.5)

rockerBOO commented 1 year ago

@BootsofLagrangian thanks for taking a look!

Settings:

network_dim=16
network_alpha=8
unet_lr=0.5
text_encoder_lr=0.5
optimizer_type="DAdaptation"
optimizer_args=["decouple=True", "weight_decay=1.0"]
lr_scheduler="constant"
min_snr_gamma=5

tried weight decay 0.5, 1.5 as well.

Screenshot 2023-04-21 at 15-23-14 TensorBoard Screenshot 2023-04-21 at 15-33-43 TensorBoard

And the last light blue is without min_snr_gamma but has the same problem

Screenshot 2023-04-21 at 15-36-17 TensorBoard

once it starts going up it starts producing noise and goes up fast and never recovers. Using the same dataset and settings (changing the learning rate, and low or no weight decay) with AdamW produces good results so its within the reported settings being changed.

Example of the noise: landscape-5-2023-04-21-152608-ce5def3a_20230421153540_000050_03_2661845567 landscape-5-2023-04-21-151628-dfe93462_20230421152026_000020_07_2661845567

Edit: Also noting I'm running batch size: 2, gradient accumulation steps: 24 in these tests. Maybe this is impacting how it is working?

BootsofLagrangian commented 1 year ago

@rockerBOO

First, rank(dimension) and alpha should be same value with D-Adaptation. α/r ratio has direct impact on learning rate and weight(model). d*lr will increase when α/r decrease. So, controlling α and r value is important and sensitive thing.

Second, I don't have any experiment with min_snr_gamma, but, I think that min_snr_gamma accelerate training, also D-Adaptation too. With combining two method, model explode in earlier step. (And there is some math to understand deeply assumption on D-Adaptation. It suppose model is a kind of Lipschitz function. But SD model doesn't. Therefore mathematically D-Adaptation does not guarantee that automatically chosen lr lead model to convergence. So, D-Adaptation with other speed-up method makes model blow up.)

Third, lr scheduler maybe is a matter. Most of Transformer model(including Stable Diffusion) use learning rate scheduler with warmup or restarts. It help model can update with small amount of weight(ΔW) and can reach to and find global minimum. You might need to consider using lr scheduler with warmup an restarts(I recommend lr_scheduler=cosine_with_restart and lr_warmup=[5~10% of total steps]).

rockerBOO commented 1 year ago

Thanks for these suggestions @BootsofLagrangian . I am still working through the different permutations and having varying results. Trying to isolate it to specific parameters that may be having a larger impact. I will try to assess and report back.

network_dim=16
network_alpha=16 # match dim
unet_lr=0.5 # the highest value of these will be the learning rate
text_encoder_lr=0.5 # the highest value of these will be the learning rate
optimizer_type="DAdaptAdam"
optimizer_args=["decouple=True", "weight_decay=1.0"] # weight decay may not be necessary, can help with overfitting, play with different values and look up for more info
lr_scheduler="cosine_with_restarts"
lr_warmup_steps=350 # 5-10% of total steps

In my initial findings, 0.5 LR, the matching rank, no min_snr_gamma (mostly to remove a variable) and using warmup and cosine_with_restart seemed to work a lot better. But it's not consistently working better with these options, and trying other options as well so haven't pinned down anything.

I would say a "warmup" would be ideal with d-adaptation in my experimentation as it tampers the dynamic learning rate down. If you have too long or too short of a warmup it can drastically affect the dynamic learning rate, it my limited experience (needs more testing). The cycling learning rate also seems to help tamper down or letting the dynamic learning rate expand somewhat.

DarksealStudios commented 1 year ago

Noob here... Do you see better results at lower rates? Setting to .5, .5, .25? .... .25, .25, .125? I feel like I do. What about doing 5epocs of 1, 1, .5 and then stopping and continuing at half rate, and repeat. I am training in kohya and am unfamiliar with writing my own step down code, so I am doing it manually. Any thoughts on this process? Is it placebo? Or is it amazing results?

BootsofLagrangian commented 1 year ago

@DarksealStudios

After some epochs, downsizing learning rate is an useful method and not a placebo effect. Most of learning rate scheduler do that. If you are interested in such effect, search with these keywords 'local minimum', 'learning rate scheduling', 'decaying learning rate'.

DarksealStudios commented 1 year ago

I only ask BootofLagrangain because of the language used in kohya when the training begins. It confused me with it's wordage... that and once I read up on the schedulers it all seemed like the schedulers already do what I was trying to mimic manually. Thank you for leting me know kohya is not overruling the settings (right?). For example, when I set to 1, .5, 1... the text learning rate is .5, the wordage makes it sound like all settings were changed to .5... I'll have to copy it next time, but I'm sure you know the text I'm talking about, something about using only the "first" setting. Anyway, thank you!

phasiclabs commented 1 year ago

Hi all - I was wondering how you are specifying different LRs for UNET and the text encoder? Unless I specify the same values in the UI, I just get the following error :

RuntimeError: Setting different lr values in different parameter groups is only supported for values of 0

Was this something that was changed in a recent update? There are some other recent reports of this, eg. https://github.com/kohya-ss/sd-scripts/issues/555

AI-Casanova commented 1 year ago

@phasiclabs I do believe that the newest version of Dadaptation can only be set to 0/1 for each TE and UNet.

This was the original implementation that allowed for a scalar

phasiclabs commented 1 year ago

Ah, ok thanks for the info - just found this post too https://github.com/kohya-ss/sd-scripts/issues/274 But I'm not seeing that (more informative) message!

wuliebucha commented 11 months ago

keywords

@BootsofLagrangian thanks for taking a look!

Settings:
network_dim=16
network_alpha=8
unet_lr=0.5
text_encoder_lr=0.5
optimizer_type="DAdaptation"
optimizer_args=["decouple=True", "weight_decay=1.0"]
lr_scheduler="constant"
min_snr_gamma=5
tried weight decay 0.5, 1.5 as well.

And the last light blue is without min_snr_gamma but has the same problem

once it starts going up it starts producing noise and goes up fast and never recovers. Using the same dataset and settings (changing the learning rate, and low or no weight decay) with AdamW produces good results so its within the reported settings being changed.

Example of the noise:

Edit: Also noting I'm running batch size: 2, gradient accumulation steps: 24 in these tests. Maybe this is impacting how it is working?

Have you solved this issue? I meet this

kohya-ss / sd-scripts

LearningRate-Free Learning Algorithm #181