Differential Output Preservation loss for LoRA

kohya-ss commented 1 month ago

A loss that brings the output when LoRA is applied without a trigger word closer to the output when LoRA is not applied.

FurkanGozukara commented 1 month ago

how do we use this can you provide some more info? @kohya-ss

FurkanGozukara commented 1 month ago

@bmaltais can we add this to gui so many people want to test

Oct 19, 2024:

Added an implementation of Differential Output Preservation (temporary name) for SDXL/FLUX.1 LoRA training.
- A method to make the output of LoRA closer to the output when LoRA is not applied, with captions that do not contain trigger words.
- Define a Dataset subset for the regularization image (is_reg = true) with .toml. Add custom_attributes.diff_output_preservation = true.
- See dataset configuration for the regularization dataset.
- Specify "number of training images x number of epochs >= number of regularization images x number of epochs".
- Specify a large value for --prior_loss_weight option (not dataset config). We recommend 10-1000.
- Set the loss in the training without using the regularization image to be close to the loss in the training using DOP.
```
[[datasets.subsets]]
image_dir = "path/to/image/dir"
num_repeats = 1
is_reg = true
custom_attributes.diff_output_preservation = true # Add this
```

kohya-ss commented 1 month ago

can we add this to gui so many people want to test

This is not tested yet, please wait a while until it is merged into the sd3 branch.

dxqbYD commented 1 month ago

@bmaltais can we add this to gui so many people want to test

Specify a large value for --prior_loss_weight option (not dataset config). We recommend 10-1000.

Set the loss in the training without using the regularization image to be close to the loss in the training using DOP.

doubt that this is a good config recommendation.

all my proof-of-concept samples were at weight 1.

Yes, loss of reg steps is then very low compared to train steps, but this is a function of the reg step prediction already being very close to the target.

you can try at weight 10 or even 1000, but I'd expect that the regularization then overwhelms the training steps, and the model doesn't learn anything anymore.

recris commented 1 month ago

A loss that brings the output when LoRA is applied without a trigger word closer to the output when LoRA is not applied.

Do you have a link to research paper (or anything) explaining the concept behind this feature? I'd like to understand this a bit more.

dxqbYD commented 1 month ago

A loss that brings the output when LoRA is applied without a trigger word closer to the output when LoRA is not applied.

Do you have a link to research paper (or anything) explaining the concept behind this feature? I'd like to understand this a bit more.

I have proposed this method and am not aware of any paper. People have noted that the idea is close to the regularization proposed by the original Dreambooth paper, but it's not the same.

Here you can find a proof-of-concept implementation that is easier to understand than the full code: https://pastebin.com/3eRwcAJD

And the SimpleTuner implementation with samples here: https://www.reddit.com/r/StableDiffusion/comments/1g2i13s/simpletuner_v112_now_with_masked_loss_training/

kohya-ss commented 1 month ago

all my proof-of-concept samples were at weight 1.

Thank you for your valuable insight. This copy is old and the latest is 10-100.

Have you tested it with Flux? I tested it at 100 on SDXL and it seemed to work better than 1, but I tested only once. It seems that FLUX needs a smaller value, 10 seems to work.

dxqbYD commented 1 month ago

all my proof-of-concept samples were at weight 1.

Thank you for your valuable insight. This copy is old and the latest is 10-100.

Have you tested it with Flux? I tested it at 100 on SDXL and it seemed to work better than 1, but I tested only once. It seems that FLUX needs a smaller value, 10 seems to work.

only tested (successfully) with Flux, so everything I said above applies to Flux.

tried with SDXL a few months ago before FLUX, but did not get any useful results - maybe this is the reason.

recris commented 1 month ago

Interesting approach. I will give it a go, using training data with adjusted captions. Though, having to double the forward passes makes this a questionable trade-off for my use case.

FurkanGozukara commented 1 month ago

all my proof-of-concept samples were at weight 1.

Thank you for your valuable insight. This copy is old and the latest is 10-100.

Have you tested it with Flux? I tested it at 100 on SDXL and it seemed to work better than 1, but I tested only once. It seems that FLUX needs a smaller value, 10 seems to work.

hopefully i will make a full research once @bmaltais adds

recris commented 1 month ago

@kohya-ss I am running into trouble when I set is_reg and conditioning_data_dir (loss masks) in the same [dataset]:

voluptuous.error.MultipleInvalid: extra keys not allowed @ data['datasets'][0]['subsets'][1]['is_reg']

It doesn't have to be the same subset, for example this gives me the error:

[[datasets]]
resolution = 512
batch_size = 4
enable_bucket = true

[[datasets.subsets]]
num_repeats = 1
image_dir = "/home/recris/sd_train/test/export/samples_default"
conditioning_data_dir = "/home/recris/sd_train/test/export/samples_default/mask"

[[datasets.subsets]]
is_reg = true
num_repeats = 1
image_dir = "/home/recris/sd_train/test/export/samples_reg"
custom_attributes.diff_output_preservation = true

UPDATE: this whole PR actually broke training with loss masks for me, I had to rollback to an earlier commit.

kohya-ss commented 1 month ago

UPDATE: this whole PR actually broke training with loss masks for me, I had to rollback to an earlier commit.

I've updated to fix the dataset with conditioning_data_dir.

@kohya-ss I am running into trouble when I set is_reg and conditioning_data_dir (loss masks) in the same [dataset]:

A subset with conditioning_data_dir and a subset with is_reg cannot coexist in the same dataset. This has not changed since before. There are three subsets: Dreambooth, finetuning, and ControlNet, which must be in separate data sets. I'll add some documentation and some checks to give appropriate error messages.

The following should work now:

[[datasets]]
resolution = 512
batch_size = 4
enable_bucket = true

[[datasets.subsets]]
num_repeats = 1
image_dir = "/home/recris/sd_train/test/export/samples_default"
conditioning_data_dir = "/home/recris/sd_train/test/export/samples_default/mask"

[[datasets]]
resolution = 512
batch_size = 4
enable_bucket = true

[[datasets.subsets]]
is_reg = true
num_repeats = 1
image_dir = "/home/recris/sd_train/test/export/samples_reg"
custom_attributes.diff_output_preservation = true

kohya-ss commented 1 month ago

only tested (successfully) with Flux, so everything I said above applies to Flux.

tried with SDXL a few months ago before FLUX, but did not get any useful results - maybe this is the reason.

Thanks for the clarification! I've updated the README.

aria1th commented 2 weeks ago

Hello, the initial idea seems to be very similar to our paper : https://arxiv.org/abs/2404.07554 and I totally agree with the idea. The key idea is, "Adapter applied + but no token" has to be same or similar as "no adapter, no token".

Some obvious limitation of this method is, it is not applicable when vanilla model is not capable or 'poisoned', 'overfitted', which causes loss diverging phenomenon (or, OFT paper described this as 'hyperspheric energy diverge'), so it might have to be used with caution.

dxqbYD commented 1 week ago

Hello, the initial idea seems to be very similar to our paper : https://arxiv.org/abs/2404.07554 and I totally agree with the idea. The key idea is, "Adapter applied + but no token" has to be same or similar as "no adapter, no token".

Some obvious limitation of this method is, it is not applicable when vanilla model is not capable or 'poisoned', 'overfitted', which causes loss diverging phenomenon (or, OFT paper described this as 'hyperspheric energy diverge'), so it might have to be used with caution.

This is the same idea. Thanks for pointing this out, I didn't know about your paper. I think this method has even more potential than just preserving prior knowledge on a separate concept ("no token" in your paper). I've written a few sentences about this here: https://github.com/Nerogar/OneTrainer/pull/505#issuecomment-2474763609 but it's mostly just unpublished experiments at this point.

FurkanGozukara commented 1 week ago

@bmaltais currently we have to manually edit the toml file to be able to use this?

did you add it to the gui? i want to try : custom_attributes.diff_output_preservation = true # Add this

@kohya-ss this is as usual as previous dreambooth training with reg images but we only add custom_attributes.diff_output_preservation = true to the toml or where?

@aria1th any way you could make a pull request to kohya for your paper implementation ?

kohya-ss commented 1 week ago

@kohya-ss this is as usual as previous dreambooth training with reg images but we only add custom_attributes.diff_output_preservation = true to the toml or where?

Just add custom_attributes.diff_output_preservation = true and it works. From my understanding it should be based on the same theory as in that paper.

FurkanGozukara commented 1 week ago

@kohya-ss this is as usual as previous dreambooth training with reg images but we only add custom_attributes.diff_output_preservation = true to the toml or where?

Just add custom_attributes.diff_output_preservation = true and it works. From my understanding it should be based on the same theory as in that paper.

started trainings for testing ty

aria1th commented 1 week ago

@FurkanGozukara I think current implementation is more stable than our original implementation - since it takes 2x VRAM at original (which effectively forces batch size 2+, always one with condition, one without condition). Current implementation in kohya-ss allows you to dynamically use batch size - however it we observed that asynchronous / separated updates does not ensure knowledge preservation behavior.

But yeah, the token removing function seems to be missing in the implementation, I guess that should be added to correctly specify what has to be preserved as contrastive data.

We are preparing other paper for this phenomenon. I'll prepare to upload results * push pr within few weeks with some intensive experiment results

dxqbYD commented 1 week ago

But yeah, the token removing function seems to be missing in the implementation, I guess that should be added to correctly specify what has to be preserved as contrastive data.

isn't the token removal just using different prompts for regular training steps and regularization steps? In kohya and other trainers that's done by configuration, so it isn't as obvious in the code, but it's still there.

if you configure kohya exactly as you propose in your paper, to train with batch size 2, 1 image is regularization and 1 image is training, the conditioning of the the call to the transformer will be with 2 different text embeddings - one "with token", one "without token".

however it we observed that asynchronous / separated updates does not ensure knowledge preservation behavior.

Have you actually observed this in experiments, not only theory? The author of SimpleTuner who also implemented this feature has reported the opposite, that it's better to have it in separate batches. I've always done separate batches in my experimental implementation in OneTrainer, but I think both should work equally well over the long run. It'll even itself out stochastically. The parameter updates are averaged by the optimizer EMA anyway.

aria1th commented 1 week ago

@dxqbYD Yes, our paper aims to implement continual learning (for some more specific personalization) - so concept has to be controllable with class token, but in general, flux/large DiT tuning with current implementation means not losing original knowledge generally so its the difference. and yes, as tuning Illustrious model and with large batch experiments, we observed that batch size 1x32 update behaving worse than 2x32 updates (same LR), or 2x16 update (double LR).(We experimented with 256/512/4096 tho) but, gradient accumulation is okay, that is why batch (as providing reg images as dataset) is more stable & better way to handle this.

Having them in same batch (without class token condition separation), could lead to worse result, 'hyperspheric energy diverge' like thing, it gets really unstable, however.

With EMA, theoritically it should be okay in either way.

(We're still experimenting and checking more stuff but kinda sure about this)

dxqbYD commented 1 week ago

Thanks for your reply. To translate this for users: If you are using any optimizer with beta1 > 0, which is basically everything except Adafactor, it shouldn't matter much how the batches are organized.

I'm having good experimental results with Adafactor on Flux too though, but I'm not doubting that there might be cases it works less well with SGD only when regularization is in a separate batch.

FurkanGozukara commented 1 week ago

my results failed completely for some reason - used fast branch but i think it has this merged code already

i used below dataset config and 5200 real man images as regularization images (collected from unsplash) , did 1 repeat

i used 28 training images, set repeating 200

so did total 10800 steps, i assume 5200 made for reg images and 5600 made for training images

i didnt use any other captions tokens etc only

ohwx and man

1 : reduced likeliness - no reg images used output is way better and working as expected

2 : 0 improvement of bleeding

3 : all dataset still turned into photos of me

kohya-ss commented 1 week ago

Thank you everyone for sharing your valuable insights.

I hadn't considered the batch composition at all, but if it doesn't have an impact when using standard optimizers, it seems this shouldn't be an issue.

Let me add one note: in the current implementation of sd-scripts, the batch composition of datasets with regard to samples with and without trigger tokens is completely random. This means that with a batch size of 2, you could have any combination: "with/with", "with/without", or "without/without". This is to simplify the dataset implementation, which includes features like aspect ratio bucketing.

Based on the discussion so far, I believe this random batch composition should not cause any problems.

kohya-ss commented 1 week ago

@FurkanGozukara Have you set --prior_loss_weight to a large value? The default of 1 has little effect on DOP. Please try 100 or so.

FurkanGozukara commented 1 week ago

@FurkanGozukara Have you set --prior_loss_weight to a large value? The default of 1 has little effect on DOP. Please try 100 or so.

i tried 1 , it was FLUX

so you still say 100? or 10?

ok started testing

1 vs 10 vs 100 :) also testing lokr

dxqbYD commented 1 week ago

you can try higher values, but my guess is that something else must be wrong. I've always used 1 on Flux, and the effect is so strong it is undeniable. The SimpleTuner author has published one of my samples here. This was with weight 1: https://www.reddit.com/r/StableDiffusion/comments/1g2i13s/simpletuner_v112_now_with_masked_loss_training

look at the second column and compare with the rest

FurkanGozukara commented 1 week ago

I have yesterday started training with prior_loss_weight 1 10 100

prior_loss_weight 10 is definitely preventing bleeding issue but my character, myself, is not anymore learnt properly - still training though at 100 epoch right now

and prior_loss_weight 100 is like model didnt learn anything at all

I am testing LoKR as well but it sacrifices quality - didnt test reg images method with LoKR

download (21)

heinrichI commented 3 days ago

The following should work now:

[[datasets]]
resolution = 512
batch_size = 4
enable_bucket = true

[[datasets.subsets]]
num_repeats = 1
image_dir = "/home/recris/sd_train/test/export/samples_default"
conditioning_data_dir = "/home/recris/sd_train/test/export/samples_default/mask"

[[datasets]]
resolution = 512
batch_size = 4
enable_bucket = true

[[datasets.subsets]]
is_reg = true
num_repeats = 1
image_dir = "/home/recris/sd_train/test/export/samples_reg"
custom_attributes.diff_output_preservation = true

This don't work for me. datasetMeMaskReg.toml:

[general]
shuffle_caption = false
caption_extension = '.txt'
keep_tokens = 1

[[datasets]]
batch_size = 2
enable_bucket = true
resolution = [1024, 1024]

  [[datasets.subsets]]
  image_dir = 'j:\Train'
  num_repeats = 1
  conditioning_data_dir = 'j:\TrainMask'

[[datasets]]
batch_size = 2
enable_bucket = true
resolution = [1024, 1024]

  [[datasets.subsets]]
  is_reg = true
  image_dir = 'j:\RegulizationPeople'
  num_repeats = 2
  custom_attributes.diff_output_preservation = true

And the output in the process:

num train images * repeats / 学習画像の数×繰り返し回数: 102
num reg images / 正則化画像の数: 178
num batches per epoch / 1epochのバッチ数: 55
num epochs / epoch数: 100
batch size per device / バッチサイズ: 2, 2
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 5500

As you can see, the number of batches is equal to half of the images, and the regulization images are completely ignored.

kohya-ss / sd-scripts

Differential Output Preservation loss for LoRA #1710