sdxl_train_network.py: Lora trains poorly and produces terrible results

drimeF0 commented 1 year ago

I tried 10 times to train lore on Kaggle and google colab, and each time the training results were terrible even after 5000 training steps on 50 images. I use this sequence of commands:

%cd /content/kohya_ss/finetune
!python3 merge_captions_to_metadata.py --caption_extention .txt /content/dataset /content/dataset.json
!python3 prepare_buckets_latents.py --bucket_reso_steps 64 --min_bucket_reso 1024 --max_bucket_reso 1024 /content/dataset /content/dataset.json /content/dataset_lat.json "femboysLover/blue_pencil-fp16-XL"

%%writefile /content/prompt.txt
masterpiece, best quality, one eye closed, solo, 1girl --w 1024 --h 1024 --d 123

%cd /content/kohya_ss
!python3 "./sdxl_train_network.py" --in_json "/content/dataset_lat.json" --pretrained_model_name_or_path="femboysLover/blue_pencil-fp16-XL" --train_data_dir="/content/dataset" --resolution="1024,1024" --output_dir="/content/out" --network_alpha="16"  --network_dim "32" --save_model_as=safetensors --network_module=networks.lora --output_name="last" --no_half_vae --learning_rate="0.0001" --lowram --lr_scheduler="constant" --train_batch_size="4" --max_train_steps="5000" --save_every_n_steps="100" --mixed_precision="fp16" --save_precision="fp16" --seed="12345"  --optimizer_type="AdamW8bit" --min_snr_gamma=5 --mem_eff_attn --gradient_checkpointing --full_fp16 --xformers --sample_sampler=euler_a --sample_prompts="/content/prompt.txt" --sample_every_n_steps="50" --network_train_unet_only --cache_text_encoder_outputs

I noticed this in the sdxl_train_network execution log:

number of images (including repeats) / 各bucketの画像枚数（繰り返し回数を含む）
bucket 0: resolution (256, 1024), count: 8
bucket 1: resolution (512, 512), count: 42
mean ar error (without repeats): 0.0

Could this break LoRA? And if so, how to fix it?

RockTheCosmos commented 1 year ago

You don't want to train SDXL with 256x1024 and 512x512 images; those are too small. You should use 1024x1024 resolution for 1:1 aspect ratio and 512x2048 for 1:4 aspect ratio. Did you disable upscaling bucket resolutions?

drimeF0 commented 1 year ago

You don't want to train SDXL with 256x1024 and 512x512 images; those are too small. You should use 1024x1024 resolution for 1:1 aspect ratio and 512x2048 for 1:4 aspect ratio. Did you disable upscaling bucket resolutions?

I fixed the bucket resolution.
%cd /content/kohya_ss/finetune
!python3 prepare_buckets_latents.py --bucket_reso_steps 64
--min_bucket_reso 1024 --max_bucket_reso 1024 --max_resolution 1024,1024
/content/dataset /content/dataset.json /content/dataset_lat.json
"femboysLover/blue_pencil-fp16-XL"
%cd /content/kohya_ss
!python3 "./sdxl_train_network.py" --in_json "/content/dataset_lat.json"
--network_weights "/content/out/last-step00000050.safetensors"
--pretrained_model_name_or_path="femboysLover/blue_pencil-fp16-XL"
--train_data_dir="/content/dataset" --resolution="1024,1024"
--output_dir="/content/out" --network_alpha="16" --network_dim "32"
--save_model_as=safetensors --network_module=networks.lora
--output_name="last" --no_half_vae --learning_rate="5e-5" --lowram
--lr_scheduler="constant" --train_batch_size="4" --max_train_steps="5000"
--save_every_n_steps="50" --mixed_precision="fp16" --save_precision="fp16"
--seed="12345" --optimizer_type="AdamW8bit" --min_snr_gamma=5
--mem_eff_attn --gradient_checkpointing --full_fp16 --xformers
--sample_sampler=euler_a --sample_prompts="/content/prompt.txt"
--sample_every_n_steps="50" --network_train_unet_only
--cache_text_encoder_outputs
The result is still terrible, with a high LR it turns out to be some kind of mess, and with a low LR the model simply slowly degrades and the style changes. I tried it on different datasets, currently as a test I’m using 48 images from safebooru for the one_closed_eye tag.

drimeF0 commented 1 year ago

Без имени Here is an example image at 600 training steps

%cd /content/kohya_ss
!python3 "./sdxl_train_network.py" --in_json "/content/dataset_lat.json"  --pretrained_model_name_or_path="femboysLover/blue_pencil-fp16-XL" --train_data_dir="/content/dataset" --resolution="1024,1024" --output_dir="/content/out" --network_alpha="32"  --network_dim "64" --save_model_as=safetensors --network_module=networks.lora --output_name="last" --no_half_vae --learning_rate="1.0" --scale_weight_norms=1 --lr_scheduler="adafactor" --lr_scheduler_num_cycles="1" --train_batch_size="4" --max_train_steps="5000" --save_every_n_steps="50" --mixed_precision="fp16" --save_precision="fp16" --seed="12345"  --optimizer_type="adafactor" --mem_eff_attn --gradient_checkpointing --full_fp16 --xformers --sample_sampler=euler_a --sample_prompts="/content/prompt.txt" --sample_every_n_steps="50" --network_train_unet_only --cache_text_encoder_outputs --lowram

RockTheCosmos commented 1 year ago

Can you provide an example of what it looks like when you train at .0001 learning rate?

drimeF0 commented 1 year ago

.0001

50 steps Без имени

200 steps Без имени

%cd /content/kohya_ss
!python3 "./sdxl_train_network.py" --in_json "/content/dataset_lat.json"  --pretrained_model_name_or_path="femboysLover/blue_pencil-fp16-XL" --train_data_dir="/content/dataset" --resolution="1024,1024" --output_dir="/content/out" --network_alpha="32"  --network_dim "64" --save_model_as=safetensors --network_module=networks.lora --output_name="last" --no_half_vae --learning_rate="0.0001" --scale_weight_norms=1 --lr_scheduler="adafactor" --lr_scheduler_num_cycles="1" --train_batch_size="4" --max_train_steps="5000" --save_every_n_steps="50" --mixed_precision="fp16" --save_precision="fp16" --seed="12345"  --optimizer_type="adafactor" --mem_eff_attn --gradient_checkpointing --full_fp16 --xformers --sample_sampler=euler_a --sample_prompts="/content/prompt.txt" --sample_every_n_steps="50" --network_train_unet_only --cache_text_encoder_outputs --lowram

RockTheCosmos commented 1 year ago

I've never used the full_fp16 or lowram settings before, so I don't know if those could be negatively affecting your results. Are you using the "#_trigger class" naming convention in your image inputs folder?

drimeF0 commented 1 year ago

I've never used the full_fp16 or lowram settings before, so I don't know if those could be negatively affecting your results. Are you using the "#_trigger class" naming convention in your image inputs folder?

No, I don't use it, as this is so much I know for dreambooth training, and I train LoRA based on tags with safebooru

kohya-ss commented 1 year ago

Could you please test with the basic settings, such as AdamW optimizer, constant scheduler, network_alpha=1, learning rate=1e-4?

drimeF0 commented 1 year ago

Could you please test with the basic settings, such as AdamW optimizer, constant scheduler, network_alpha=1, learning rate=1e-4?

firefox_Ijyv9ZjkVv NaN loss and black images

%cd /content/kohya_ss
!python3 "./sdxl_train_network.py" --in_json "/content/dataset_lat.json"  --pretrained_model_name_or_path="femboysLover/blue_pencil-fp16-XL" --train_data_dir="/content/dataset" --resolution="1024,1024" --output_dir="/content/out" --network_alpha="1"  --network_dim "64" --save_model_as=safetensors --network_module=networks.lora --output_name="last" --no_half_vae --learning_rate="1e-4" --lr_scheduler="constant" --train_batch_size="4" --max_train_steps="5000" --save_every_n_steps="50" --mixed_precision="fp16" --save_precision="fp16" --seed="12345"  --optimizer_type="adamw" --mem_eff_attn --gradient_checkpointing --full_fp16 --xformers --sample_sampler=euler_a --sample_prompts="/content/prompt.txt" --sample_every_n_steps="50" --network_train_unet_only --cache_text_encoder_outputs --lowram

kohya-ss commented 1 year ago

NaN loss and black images

fp16 training (mixed_precision, save_precision and full_fp16) seemed to cause the NaN issue. Please use bf16 instead of fp16. accerarate config is also needed.

drimeF0 commented 1 year ago

bf16

It seems google colab with T4 does not support bf16 firefox_DJkdFsKuwW

firefox_QRGIhkxAzZ

firefox_DtfwemTnRD

drimeF0 commented 1 year ago

~~what about --scale_weight_norms=1? I've heard that this can solve the problem with NaN loss~~ I tested with different values and it didn't work

DKnight54 commented 1 year ago

I'm guessing you are using this implementation in Colab.

Couple of things I'd like to check

What is the original resolution of the images you are using?
You seem to have selected the "cache_text_encoder_outputs", have you tried training without that?
You mentioned Dreambooth training, but AFAIK, that needs regularizartion images, which I haven't been able to get working yet. (If you did, TEACH ME SEMPAI)
Can you paste the output from the "Bucketing and Latents Caching" step?
Have you tried setting a VAE?
You mentioned testing a few datasets, have you tried on different models?
I think you are training on the model you'd like to use. Prehaps try training on base SDXL model then apply the Lora on that model?
It seems like you are trying to train the "one eye closed" concept. If that's the case, I'm under the impression that for best results, you'd need a larger than 50 image sample size, and to carefully tag everything about the images. (I like to refer to this resource as a guide)

I've been getting decent results on my first try and here are my bucketing and latent caching output (Ignore the total numbers, I'm obsessed and went overboard)

Found 379 images. Creating a new metadata file Merging tags and captions into metadata json. 100% 379/379 [00:24<00:00, 15.23it/s] No captions found for any of the 379 images All 379 images have tags Cleaning captions and tags. 100% 379/379 [00:00<00:00, 3465.12it/s] Writing metadata: /content/LoRA/meta_clean.json Done! found 379 images. loading existing metadata: /content/LoRA/meta_clean.json load VAE: /content/vae/sdxl_vae.safetensors 100% 379/379 [00:15<00:00, 24.25it/s] bucket 0 (448, 1024): 20 bucket 1 (512, 1024): 2 bucket 2 (576, 1024): 39 bucket 3 (640, 1024): 1 bucket 4 (704, 1024): 51 bucket 5 (768, 1024): 92 bucket 6 (832, 1024): 1 bucket 7 (1024, 448): 1 bucket 8 (1024, 576): 2 bucket 9 (1024, 704): 7 bucket 10 (1024, 768): 21 bucket 11 (1024, 1024): 142 mean ar error: 0.01277014918625594 writing metadata: /content/LoRA/meta_lat.json done!

And Training Config output

[sdxl_arguments] cache_text_encoder_outputs = false no_half_vae = true min_timestep = 0 max_timestep = 1000 shuffle_caption = true lowram = true

[model_arguments] pretrained_model_name_or_path = "stabilityai/stable-diffusion-xl-base-1.0" vae = "/content/vae/sdxl_vae.safetensors"

[dataset_arguments] debug_dataset = false in_json = "/content/LoRA/meta_lat.json" train_data_dir = "/content/drive/MyDrive/LoRA/train_2" dataset_repeats = 5 keep_tokens = 0 resolution = "1024,1024" color_aug = false token_warmup_min = 1 token_warmup_step = 0

[training_arguments] output_dir = "/content/drive/MyDrive/kohya-trainer/output/Blue_Waifu" output_name = "Blue_Waifu" save_precision = "fp16" save_every_n_epochs = 1 train_batch_size = 5 max_token_length = 225 mem_eff_attn = false sdpa = true xformers = false max_train_epochs = 3 max_data_loader_n_workers = 8 persistent_data_loader_workers = true gradient_checkpointing = true gradient_accumulation_steps = 1 mixed_precision = "fp16"

[logging_arguments] log_with = "tensorboard" logging_dir = "/content/LoRA/logs" log_prefix = "Blue_Waifu"

[sample_prompt_arguments] sample_every_n_epochs = 1 sample_sampler = "euler_a"

[saving_arguments] save_model_as = "safetensors"

[optimizer_arguments] optimizer_type = "AdaFactor" learning_rate = 0.0001 max_grad_norm = 0 optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False",] lr_scheduler = "constant_with_warmup" lr_warmup_steps = 100

[additional_network_arguments] no_metadata = false network_module = "networks.lora" network_dim = 32 network_alpha = 16 network_args = [] network_train_unet_only = true

[advanced_training_config] save_state = false save_last_n_epochs_state = false multires_noise_iterations = 6 multires_noise_discount = 0.3 caption_dropout_rate = 0 caption_tag_dropout_rate = 0.1 caption_dropout_every_n_epochs = 0 min_snr_gamma = 5

[prompt] negative_prompt = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, " width = 1024 height = 1024 scale = 12 sample_steps = 28 [[prompt.subset]] prompt = "masterpiece, best quality, face focus, cute, 1girl, green hair, sweater, looking at viewer, upper body, beanie, outdoors, night, turtleneck"

drimeF0 commented 1 year ago

I'm guessing you are using this implementation in Colab.

Couple of things I'd like to check
1. What is the original resolution of the images you are using?

2. You seem to have selected the "cache_text_encoder_outputs", have you tried training without that?

3. You mentioned Dreambooth training, but AFAIK, that needs regularizartion images, which I haven't been able to get working yet. (If you did, TEACH ME SEMPAI)

4. Can you paste the output from the "Bucketing and Latents Caching" step?

5. Have you tried setting a VAE?

6. You mentioned testing a few datasets, have you tried on different models?

7. I think you are training on the model you'd like to use. Prehaps try training on base SDXL model then apply the Lora on that model?

8. It seems like you are trying to train the "one eye closed" concept. If that's the case, I'm under the impression that for best results, you'd need a larger than 50 image sample size, and to carefully tag everything about the images. (I like to refer to [this resource](https://rentry.org/59xed3) as a guide)
I've been getting decent results on my first try and here are my bucketing and latent caching output (Ignore the total numbers, I'm obsessed and went overboard)

Found 379 images. Creating a new metadata file Merging tags and captions into metadata json. 100% 379/379 [00:24<00:00, 15.23it/s] No captions found for any of the 379 images All 379 images have tags Cleaning captions and tags. 100% 379/379 [00:00<00:00, 3465.12it/s] Writing metadata: /content/LoRA/meta_clean.json Done! found 379 images. loading existing metadata: /content/LoRA/meta_clean.json load VAE: /content/vae/sdxl_vae.safetensors 100% 379/379 [00:15<00:00, 24.25it/s] bucket 0 (448, 1024): 20 bucket 1 (512, 1024): 2 bucket 2 (576, 1024): 39 bucket 3 (640, 1024): 1 bucket 4 (704, 1024): 51 bucket 5 (768, 1024): 92 bucket 6 (832, 1024): 1 bucket 7 (1024, 448): 1 bucket 8 (1024, 576): 2 bucket 9 (1024, 704): 7 bucket 10 (1024, 768): 21 bucket 11 (1024, 1024): 142 mean ar error: 0.01277014918625594 writing metadata: /content/LoRA/meta_lat.json done!

And Training Config output

[sdxl_arguments] cache_text_encoder_outputs = false no_half_vae = true min_timestep = 0 max_timestep = 1000 shuffle_caption = true lowram = true

[model_arguments] pretrained_model_name_or_path = "stabilityai/stable-diffusion-xl-base-1.0" vae = "/content/vae/sdxl_vae.safetensors"

[dataset_arguments] debug_dataset = false in_json = "/content/LoRA/meta_lat.json" train_data_dir = "/content/drive/MyDrive/LoRA/train_2" dataset_repeats = 5 keep_tokens = 0 resolution = "1024,1024" color_aug = false token_warmup_min = 1 token_warmup_step = 0

[training_arguments] output_dir = "/content/drive/MyDrive/kohya-trainer/output/Blue_Waifu" output_name = "Blue_Waifu" save_precision = "fp16" save_every_n_epochs = 1 train_batch_size = 5 max_token_length = 225 mem_eff_attn = false sdpa = true xformers = false max_train_epochs = 3 max_data_loader_n_workers = 8 persistent_data_loader_workers = true gradient_checkpointing = true gradient_accumulation_steps = 1 mixed_precision = "fp16"

[logging_arguments] log_with = "tensorboard" logging_dir = "/content/LoRA/logs" log_prefix = "Blue_Waifu"

[sample_prompt_arguments] sample_every_n_epochs = 1 sample_sampler = "euler_a"

[saving_arguments] save_model_as = "safetensors"

[optimizer_arguments] optimizer_type = "AdaFactor" learning_rate = 0.0001 max_grad_norm = 0 optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False",] lr_scheduler = "constant_with_warmup" lr_warmup_steps = 100

[additional_network_arguments] no_metadata = false network_module = "networks.lora" network_dim = 32 network_alpha = 16 network_args = [] network_train_unet_only = true

[advanced_training_config] save_state = false save_last_n_epochs_state = false multires_noise_iterations = 6 multires_noise_discount = 0.3 caption_dropout_rate = 0 caption_tag_dropout_rate = 0.1 caption_dropout_every_n_epochs = 0 min_snr_gamma = 5

[prompt] negative_prompt = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, " width = 1024 height = 1024 scale = 12 sample_steps = 28 [[prompt.subset]] prompt = "masterpiece, best quality, face focus, cute, 1girl, green hair, sweater, looking at viewer, upper body, beanie, outdoors, night, turtleneck"

1. here are the image resolutions from the dataset and their number:

(2002, 3508): 1,
 (1100, 1100): 1,
 (2160, 3840): 1,
 (2564, 3624): 1,
 (1278, 1278): 1,
 (512, 512): 1,
 (1270, 945): 1,
 (992, 1403): 1,
 (1254, 1771): 2,
 (2432, 3200): 1,
 (1448, 2048): 1,
 (1920, 2560): 1,
 (5787, 5785): 1,
 (1000, 1432): 1,
 (652, 990): 1,
 (1024, 1024): 2,
 (1542, 2041): 1,
 (2300, 3800): 1,
 (1329, 1873): 1,
 (888, 1013): 1,
 (844, 1341): 1,
 (690, 930): 1,
 (775, 1100): 1,
 (900, 1200): 1,
 (2353, 4093): 1,
 (826, 1200): 1,
 (562, 800): 1,
 (700, 1050): 1,
 (2083, 3542): 1,
 (1250, 1761): 1,
 (800, 1225): 1,
 (2591, 3624): 1,
 (1200, 1389): 1,
 (1576, 3745): 1,
 (1600, 2400): 1,
 (581, 841): 1,
 (1507, 1653): 1,
 (1530, 2047): 1,
 (1447, 2047): 1

yes
I tried dreambooth without regularization, the result is the same as with training with tags

/content/kohya_ss/finetune
2023-09-21 07:51:49.881718: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
found 41 images.
loading existing metadata: /content/dataset.json
load VAE: femboysLover/blue_pencil-fp16-XL
exception occurs in loading vae: femboysLover/blue_pencil-fp16-XL does not appear to have a file named config.json.
retry with subfolder='vae'
Downloading (…)main/vae/config.json: 100% 602/602 [00:00<00:00, 3.40MB/s]
Downloading (…)ch_model.safetensors: 100% 167M/167M [00:00<00:00, 201MB/s]
The config attributes {'force_upcast': True} were passed to AutoencoderKL, but are not expected and will be ignored. Please verify your config.json configuration file.
83% 34/41 [00:37<00:07,  1.10s/it]/usr/local/lib/python3.10/dist-packages/PIL/Image.py:996: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images
warnings.warn(
100% 41/41 [00:43<00:00,  1.07s/it]
bucket 0 (768, 1024): 31
bucket 1 (896, 1024): 2
bucket 2 (960, 1024): 1
bucket 3 (1024, 768): 1
bucket 4 (1024, 1024): 6
mean ar error: 0.05769353115186938
writing metadata: /content/dataset_lat.json
done!

yes
I also tried DreamShaperXL, animagine-xl, counterfeitXL, DeepBlue
I'll try, but most likely the result will also be terrible
I tried both 100 and 200 images, but always in the end LoRA was trained to break the generated image, but not the desired concept

FurkanGozukara commented 1 year ago

I am using same settings as in my this video - all 1024x1024 very easy training with 13 pictures of myself

repeating 40 trained up to 8 epochs all epochs overtrained

the results are super super overtrained nothing like

something seriously broken with SDXL LoRA

attached the trained config as txt best_settings_32_rank_lora.txt

doing more tests to figure out the issue. dreambooth training on the other hand working amazing

here a tweet where I compared results : https://twitter.com/GozukaraFurkan/status/1704625590437814424

Become A Master Of SDXL Training With Kohya SS LoRAs - Combine Power Of Automatic1111 & SDXL LoRAs

FurkanGozukara commented 1 year ago

something is so much broken with lora training

5e-5 speed

https://twitter.com/GozukaraFurkan/status/1704867030761984091

i am gonna test my tutorial commit and let you know

DKnight54 commented 1 year ago

@drimeF0, One thing I'm noticing in your bucket is that the images are heavily skewed towards 768 x 1024 with 31 images and only 6 for the 1024 x 1024, and the samples you are generating are 1024 x 1024. This may be resulting in the 1024 x 1024 images being undertrained with insufficient samples.

Can you try generating images with 768 x 1024 dimensions? I suspect the results there may be more decent. One thing I've been doing in a personal copy of the colab, has been modifying the code so that the max_bucket_reso is 1536 and min_bucket_reso is 640 to try to get bucket distribution that's more inline with the recommended for SDXL. I'm reluctant to share my version for now because it's a hacked together version that does dreambooth with regularization and there are parts that are still rough and inelegant, mainly because I want it to work on the free tier. (TLDR, in order to get things to run without overruning the usage time restrictions, bucketing and latent processing is done on manual folder by folder step)

One thing I did for my dataset was cropping and resizing majority of my images into the recommended SDXL training image sizes, cribbed from this tutorial by our good friend @FurkanGozukara.

Since in the training process, images will only be trained against images in the same bucket, one thing I did was crop and rescaled the same image into mulitple sizes corresponding to the respective aspect ratios. You will note that I got lazy towards the end and skip that for several images, resulting in buckets with 1 image in them.

If the images generated at 768 x 1024 are acceptable (You might have to retrain the LoRA at the proper standing learning rate of 1e-4 in order to avoid getting overcooked results), then a quick and dirty way to get the results you want may be to crop and resize most of your images to 1024 x 1024 and use that as your dataset

DKnight54 commented 1 year ago

Oh, one more thing. I'm not exactly sure, but it seems like the model you are using for base training does not have a built in VAE. Not sure if that will affect the quality of the latents

DKnight54 commented 1 year ago

@FurkanGozukara Maybe something for you to test, but my experience with training with 379 images with 4 to 5 repeats actually got me pretty accurate results (and maybe slightly overtrained as the images all becase photorealistic unless I reduce the weight of the word "photo") within 3 to 5 epochs

FurkanGozukara commented 1 year ago

I found my error

I were using SDXL 0.9 beta release which have different weights and FP32 :)

ignore my message

drimeF0 commented 1 year ago

@drimeF0, One thing I'm noticing in your bucket is that the images are heavily skewed towards 768 x 1024 with 31 images and only 6 for the 1024 x 1024, and the samples you are generating are 1024 x 1024. This may be resulting in the 1024 x 1024 images being undertrained with insufficient samples.

Can you try generating images with 768 x 1024 dimensions? I suspect the results there may be more decent. One thing I've been doing in a personal copy of the colab, has been modifying the code so that the max_bucket_reso is 1536 and min_bucket_reso is 640 to try to get bucket distribution that's more inline with the recommended for SDXL. I'm reluctant to share my version for now because it's a hacked together version that does dreambooth with regularization and there are parts that are still rough and inelegant, mainly because I want it to work on the free tier. (TLDR, in order to get things to run without overruning the usage time restrictions, bucketing and latent processing is done on manual folder by folder step)

One thing I did for my dataset was cropping and resizing majority of my images into the recommended SDXL training image sizes, cribbed from this tutorial by our good friend @FurkanGozukara.

Since in the training process, images will only be trained against images in the same bucket, one thing I did was crop and rescaled the same image into mulitple sizes corresponding to the respective aspect ratios. You will note that I got lazy towards the end and skip that for several images, resulting in buckets with 1 image in them.

If the images generated at 768 x 1024 are acceptable (You might have to retrain the LoRA at the proper standing learning rate of 1e-4 in order to avoid getting overcooked results), then a quick and dirty way to get the results you want may be to crop and resize most of your images to 1024 x 1024 and use that as your dataset

I already tried the bucket size 1024x1024, the results are the same

DKnight54 commented 1 year ago

@drimeF0, I have to admit I'm stumped. I set out to try recreating your issue in this colab implementation following the majority of the default settings in it, and was able to get results like this: one_eye_closed_20230922144122_e000004_00 prompt = "masterpiece, best quality, one eye closed, solo, 1girl"

I would note that I did modify the bucket to have min_bucket_reso of 640 and max_bucket_reso of 1536, but aside from that, the only difference I can note are the following:

This implementation defaults to AdaFactor optimizer which gave me good results with the default settings.
This implementation follows the repeats x epoch method of doing training, while I think I see you using steps instead.

For further reference, this result was training with scraping 98 images from safebooru with the following tags: "one_eye_closed, 1girl, absurdres", tags scraped from safebooru, and the clean_caption option used at step 3.4. Bucketing and Latents Caching.

The results were from the 4th epoch, with the training image repeated 10 times each epoch at batch size 5.

I think... in order for me to try troubleshooting more, I probably need to look at how you are implementing the training in your enviroment. And even then, I may not be able to help if it involves tweaking Kohya's code.

(Edited to include sample prompt)

DKnight54 commented 1 year ago

Side note, I have similar issues where the LoRA keeps outputing both eyes closed. I believe that in order to fix this issue, we would need to expand the training data set to include "eyes_closed" images where both eyes are closed, and images where both eyes are open for the LoRA to learn the difference.

drimeF0 commented 1 year ago

Side note, I have similar issues where the LoRA keeps outputing both eyes closed. I believe that in order to fix this issue, we would need to expand the training data set to include "eyes_closed" images where both eyes are closed, and images where both eyes are open for the LoRA to learn the difference.

Hi, sorry for taking so long to respond. Here's the notebook: https://colab.research.google.com/drive/14BnL_yiyVGs8WFovba1t3UjcjnZiVrpi?usp=sharing

DKnight54 commented 1 year ago

@drimeF0 I made a copy and modified it with the following changes:

added recursive metadata and bucketing to support multiple concepts
Changed min_bucket_reso to 640 and max_bucket_reso to 1536
used the SDXL VAE for latents and training
changed from steps to using repeats+epoch

I'm still running my intial test with three separate concepts on this modified version. An earlier attempt with only eyes_closed and one_eye_closed is still getting me boths eyes closed @@

eyes_open: -one_eye_closed, -eyes_closed, solo, 1girl , highres
eyes_closed: eyes_closed, solo, 1girl , highres
one_eye_closed: one_eye_closed, solo, 1girl , highres

https://colab.research.google.com/drive/1j8y5AvtdiXl4_8CHk3wd7xxre0u5Vsxf?usp=sharing

I'm getting the impression that this notebook is originally meant to run on Kraggle, since on Google Colab free tier, unless you are closely monitoring, all the work will be lost when Google happily d/c me

DKnight54 commented 1 year ago

While I wasn't successful at training the concept (Which may require training the text encoder as well, and possibly better tagging of the dataset, maybe trying SDP instead of xformers, or just more training, or mentioned below, replacing the keyword tag to wink), I managed to hit step 579 (Epoch 3) without the model collasping.

As a matter of fact, it managed to maintain a very coherent style! (Not sure if this is the normal style expected from this model as I haven't really played with it before.

Epoch 1(step 193):

Epoch 2(step 386):

Epoch 3(step 579): image prompt: masterpiece, best quality, one eye closed, solo, 1girl --w 768 --h 1024 --d 123

At step 579, I also generated a couple more samples, and they still seem to carry the default styling. image prompt: masterpiece, best quality, one eye closed, solo, 1girl --w 1024 --h 1024

image prompt: masterpiece, best quality, one eye closed, solo, 1girl --w 832 --h 1216

I'm not likely to get more out of the training at this point due to running it on free colab, and while I can't think of how to better help you train the specific one eye closed concept (Hrm... maybe replace one eye closed with wink? might be a more natural concept to train), I hope that I've at least help you troubleshoot the model collapse issue.

drimeF0 commented 1 year ago

While I wasn't successful at training the concept (Which may require training the text encoder as well, and possibly better tagging of the dataset, maybe trying SDP instead of xformers, or just more training, or mentioned below, replacing the keyword tag to wink), I managed to hit step 579 (Epoch 3) without the model collasping.

As a matter of fact, it managed to maintain a very coherent style! (Not sure if this is the normal style expected from this model as I haven't really played with it before.

Epoch 1(step 193):

Epoch 2(step 386):

Epoch 3(step 579): image prompt: masterpiece, best quality, one eye closed, solo, 1girl --w 768 --h 1024 --d 123

At step 579, I also generated a couple more samples, and they still seem to carry the default styling. image prompt: masterpiece, best quality, one eye closed, solo, 1girl --w 1024 --h 1024

image prompt: masterpiece, best quality, one eye closed, solo, 1girl --w 832 --h 1216

I'm not likely to get more out of the training at this point due to running it on free colab, and while I can't think of how to better help you train the specific one eye closed concept (Hrm... maybe replace one eye closed with wink? might be a more natural concept to train), I hope that I've at least help you troubleshoot the model collapse issue.

thank you very much for your help with the model collapse during training, I will later try to transfer the training code to kaggle and run the code on a lower lr for 5000 training steps, in addition, I will try various datasets

DKnight54 commented 1 year ago

@drimeF0, I suspect that the root of the issue may have something to do with using steps instead of epoch/repeats for training. If I can make a suggestion, instead of using 5000 training steps, try increasing the number of epochs or repeats.

To make the changes, you can edit the --max_train_epochs "8" and --dataset_repeats "5"

I would recommend setting dataset_repeats to 10, and increasing the epochs until you get the results you desire.

drimeF0 commented 1 year ago

@drimeF0, I suspect that the root of the issue may have something to do with using steps instead of epoch/repeats for training. If I can make a suggestion, instead of using 5000 training steps, try increasing the number of epochs or repeats.

To make the changes, you can edit the --max_train_epochs "8" and --dataset_repeats "5"

I would recommend setting dataset_repeats to 10, and increasing the epochs until you get the results you desire.

with learning_rate set to 0.0002 LoRA learns very well: im_20230803173010_000_219823696 im_20230803140641_000_1166968339

thank you again for your help with the LoRA model training script

kohya-ss / sd-scripts

sdxl_train_network.py: Lora trains poorly and produces terrible results #820