Open FurkanGozukara opened 1 year ago
I manually set train text encoder true and added --stop_text_encoder_training 999999
But still lora extractor is saying text encoder is same
I could reproduce the issue with same and some other settings.
I also trained with the previous version, tag v0.6.6, and Text Encoder is trained. train_db.py is almost identical in both version, so I think the most likely cause is one or some of dependent libraries. I will check it sooner. However, since it means that there is probably nothing wrong with train_db.py, it may take some time to find the cause.
I could reproduce the issue with same and some other settings.
I also trained with the previous version, tag v0.6.6, and Text Encoder is trained. train_db.py is almost identical in both version, so I think the most likely cause is one or some of dependent libraries. I will check it sooner. However, since it means that there is probably nothing wrong with train_db.py, it may take some time to find the cause.
Thank you so much looking forward to solution. I am pretty sure those transformers diffusers or accelerator one of them broken
You are doing incredible job
I hope so too, but if there is something wrong with my script, I apologize.
Thank you so much looking forward to solution. I am pretty sure those transformers diffusers or accelerator one of them broken
I hope so too, but if there is something wrong with my script, I apologize.
I hope so too, but if there is something wrong with my script, I apologize.
Thank you so much looking forward to solution. I am pretty sure those transformers diffusers or accelerator one of them broken
I hope so too, but if there is something wrong with my script, I apologize.
SDXL text encoder is also not trained
sadly no version of SDXL is training text encoder :(
I couldn't find working version with bmaltais/kohya_ss
edit : 3 months old sdxl branch working for some reason
sadly no version of SDXL is training text encoder :(
I couldn't find working version with bmaltais/kohya_ss
edit : 3 months old sdxl branch working for some reason
Just add --train_text_encoder as extra parameter and it will train the TE I think this is the intended behavior, as for extracting the lora from the Dreambooth if the TE has been trained enough to be different it will be extracted, but you can force the extraction by changing the value here (Kohya GUI but you can specify it in the command line no worries
You'll then get this and it will be extracted as expected
sadly no version of SDXL is training text encoder :( I couldn't find working version with bmaltais/kohya_ss edit : 3 months old sdxl branch working for some reason
Just add --train_text_encoder as extra parameter and it will train the TE I think this is the intended behavior, as for extracting the lora from the Dreambooth if the TE has been trained enough to be different it will be extracted, but you can force the extraction by changing the value here (Kohya GUI but you can specify it in the command line no worries
You'll then get this and it will be extracted as expected
i tested with 0.01 and 0.004 both same
learning rate 1e-5 4160 steps
still same
when i make 0.0001 it shows very tiny difference but this seems to me wrong
I will test with adding train_text_encoder command too ty
by the way difference of Stable Diffusion 1.5 is also very any ideas?
it is 0.0009 - 4160 steps 1e-6 LR
i am using adafactor
I am testing realistic vision 2 on ShivamShriraoDreamBooth colab
I wonder how much text encoder difference it will have
very low LR 4e-7 - 2080 steps
I have tested with my dataset, AdamW 8bit optimizer, various learning rates. I found:
So I believe the scripts and the libraries are fine. However, I don't know why the same settings as before would produce different training results for Text Encoder.
I wrote another script to compare Text Encoder weights. You will find embeddings.token_embedding, some norm weights and biases have a large difference than attention. The LoRA extracting script only take care of attn layers, so the script determines two Text Encoders are same.
import argparse
import torch
from safetensors.torch import load_file
parser = argparse.ArgumentParser()
parser.add_argument("model1", help="path to model1")
parser.add_argument("model2", help="path to model2")
parser.add_argument("--rtol", type=float, default=1e-8, help="relative tolerance")
parser.add_argument("--atol", type=float, default=1e-6, help="absolute tolerance")
parser.add_argument("--bf16", action="store_true", help="use bf16 instead of fp32")
args = parser.parse_args()
model1_path = args.model1
model2_path = args.model2
# Load safetensors or checkpoint from each model path
print("loading models...")
if model1_path.endswith(".safetensors"):
model1_sd = load_file(model1_path)
else:
model1_sd = torch.load(model1_path)
if model2_path.endswith(".safetensors"):
model2_sd = load_file(model2_path)
else:
model2_sd = torch.load(model2_path)
if "state_dict" in model1_sd:
model1_sd = model1_sd["state_dict"]
if "state_dict" in model2_sd:
model2_sd = model2_sd["state_dict"]
# Compare the weights of each model
prefix_to_compare = "cond_stage_model"
print("comparing weights...")
print(f"key,\tall_close,\tmax_diff,\tmean_diff,\tmax_value1,\tmin_value1")
for key in model1_sd.keys():
if key.startswith(prefix_to_compare):
if key not in model2_sd:
print(f"*** Key {key} not found in model2")
continue
if model1_sd[key].dtype == torch.long:
# doesn't compare position ids
# diff = torch.sum(model1_sd[key] != model2_sd[key])
# print(f"*** {key}: long, {diff} different values")
continue
model1_value = model1_sd[key]
model2_value = model2_sd[key]
if args.bf16:
model1_value = model1_value.to(torch.bfloat16)
model2_value = model2_value.to(torch.bfloat16)
model1_value = model1_value.to(torch.float32)
model2_value = model2_value.to(torch.float32)
all_close = torch.allclose(model1_value, model2_value, rtol=args.rtol, atol=args.atol)
diff = torch.abs(model1_sd[key] - model2_sd[key])
print(
f"{key},\t{all_close},\t{torch.max(diff)},\t{torch.mean(diff)},\t{torch.max(model1_sd[key])},\t{torch.min(model1_sd[key])}"
)
@kohya-ss thank you so much
can we say that setting higher text encoder learning rate can be more beneficial in this case?
can we give already different LR for text encoder when doing SD 1.5 or SDXL training?
@kohya-ss thank you so much
can we say that setting higher text encoder learning rate can be more beneficial in this case?
can we give already different LR for text encoder when doing SD 1.5 or SDXL training?
afaik it doesn't have a way to specifiy LR for TE.
I may have found the problem, which can be divided into two parts:
1.The initial loss values of SD1.5 training are different, which is related to line 1047 in library\model_util.py. If we change
# logging.set_verbosity_error() # don't show annoying warning
# text_model = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to(device)
# logging.set_verbosity_warning()
# print(f"config: {text_model.config}")
cfg = CLIPTextConfig(
vocab_size=49408,
hidden_size=768,
intermediate_size=3072,
num_hidden_layers=12,
num_attention_heads=12,
max_position_embeddings=77,
hidden_act="quick_gelu",
layer_norm_eps=1e-05,
dropout=0.0,
attention_dropout=0.0,
initializer_range=0.02,
initializer_factor=1.0,
pad_token_id=1,
bos_token_id=0,
eos_token_id=2,
model_type="clip_text_model",
projection_dim=768,
torch_dtype="float32",
)
text_model = CLIPTextModel._from_config(cfg)
back to
logging.set_verbosity_error() # don't show annoying warning
text_model = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to(device)
logging.set_verbosity_warning()
print(f"config: {text_model.config}")
# cfg = CLIPTextConfig(
# vocab_size=49408,
# hidden_size=768,
# intermediate_size=3072,
# num_hidden_layers=12,
# num_attention_heads=12,
# max_position_embeddings=77,
# hidden_act="quick_gelu",
# layer_norm_eps=1e-05,
# dropout=0.0,
# attention_dropout=0.0,
# initializer_range=0.02,
# initializer_factor=1.0,
# pad_token_id=1,
# bos_token_id=0,
# eos_token_id=2,
# model_type="clip_text_model",
# projection_dim=768,
# torch_dtype="float32",
# )
# text_model = CLIPTextModel._from_config(cfg)
, the initial values will be the same.
2.The training process of SD1.5 is different, which is related to line 228 in train_network.py. If we delete the following two lines, the training process will be the same:
if torch.__version__ >= "2.0.0": # PyTorch 2.0.0 以上対応のxformersなら以下が使える
vae.set_use_memory_efficient_attention_xformers(args.xformers)
sadly no version of SDXL is training text encoder :( I couldn't find working version with bmaltais/kohya_ss edit : 3 months old sdxl branch working for some reason
Just add --train_text_encoder as extra parameter and it will train the TE I think this is the intended behavior, as for extracting the lora from the Dreambooth if the TE has been trained enough to be different it will be extracted, but you can force the extraction by changing the value here (Kohya GUI but you can specify it in the command line no worries You'll then get this and it will be extracted as expected
i tested with 0.01 and 0.004 both same
learning rate 1e-5 4160 steps
still same
when i make 0.0001 it shows very tiny difference but this seems to me wrong
I had to use 0.000015 LR for it to show differences in about 8k steps, so its very slow, but the extracted lora had a working TE and behaved as expected.
sadly no version of SDXL is training text encoder :(
I couldn't find working version with bmaltais/kohya_ss
edit : 3 months old sdxl branch working for some reason
Can you provide the commit hash for the working branch?
sadly no version of SDXL is training text encoder :( I couldn't find working version with bmaltais/kohya_ss edit : 3 months old sdxl branch working for some reason
Can you provide the commit hash for the working branch?
i think it was mistaken but not sure. i will do more research
this is the branch : https://github.com/bmaltais/kohya_ss/tree/sdxl-dev
can we say that setting higher text encoder learning rate can be more beneficial in this case?
I don't think so. I think the learning rate for Text Encoder should be lower than the learning rate for U-Net in general.
can we give already different LR for text encoder when doing SD 1.5 or SDXL training?
Unfortunately, it is impossible for SD 1.5. For SDXL, we can use --block_lr
option. It specifies 23 values of the learning rate for each U-Net block, like --block_lr 1e-4,2e-4,3e-4,4e-4,5e-4,6e-4,7e-4,8e-4,9e-4,0e-4,1e-5,2e-5,3e-5,4e-5,5e-5,6e-5,7e-5,8e-5,9e-5,0e-4,1e-4,2e-4,3e-4
.
So if we set this option, the default learning rate is used for Text Encoder.
@kohya-ss ty
my text encoder enabled training is about to be completed for SDXL with
--train_text_encoder
with this command it is using exactly same VRAM is this expected?
but it is slower like 32%
1 more question
DreamBooth extension of Automatic1111 had use EMA during training option - this was significantly increasing VRAM usage but also quality
You don't have that feature?
@kohya-ss ty
my text encoder enabled training is about to be completed for SDXL with
--train_text_encoder
with this command it is using exactly same VRAM is this expected?
but it is slower like 32%
1 more question
DreamBooth extension of Automatic1111 had use EMA during training option - this was significantly increasing VRAM usage but also quality
You don't have that feature?
It sounds like you'd enjoy this repo for training more, it has adjustable lr for text/unet, EMA, masked training, etc. https://github.com/Nerogar/OneTrainer
@kohya-ss ty my text encoder enabled training is about to be completed for SDXL with --train_text_encoder with this command it is using exactly same VRAM is this expected? but it is slower like 32% 1 more question DreamBooth extension of Automatic1111 had use EMA during training option - this was significantly increasing VRAM usage but also quality You don't have that feature?
It sounds like you'd enjoy this repo for training more, it has adjustable lr for text/unet, EMA, masked training, etc. https://github.com/Nerogar/OneTrainer
thanks i should experiment and compare
@kohya-ss anyway to set LR for text encoder?
it super fast get cooked :D
https://twitter.com/GozukaraFurkan/status/1710416135747748150
@kohya-ss Not sure if you've noticed, but I just tried extracting a lora from 2 models, which I know for sure have different trained text encoders, and I still got the above "Text Encoder is same" message. I can furthermore confirm the text encoders are different, because each will produce a different image when loaded in comfyUI, see: https://i.imgur.com/xoQpxWo.png
Therefore I think the most likely issue lies simply with extract_lora_from_models.py erroneously thinking the two models are the same.
Edit: More testing; I have edited extract_lora_from_models.py to always pass true for text encoder different.
# Text Encoder might be same
#if not text_encoder_different and torch.max(torch.abs(diff)) > MIN_DIFF:
text_encoder_different = True
print(f"Forcing use of text encoder. {torch.max(torch.abs(diff))} > {MIN_DIFF}")
The resulting lora works way better than before: https://i.imgur.com/VChzcw6.jpeg Left is with skipped TE extract, right with the above modification. The right image is way closer to the style of the trained model.
my text encoder enabled training is about to be completed for SDXL with
--train_text_encoder
with this command it is using exactly same VRAM is this expected?
but it is slower like 32%
--train_text_encoder
option should increate VRAM usage. But I have less experience for training Text Encoders. It is needed to check the result.
DreamBooth extension of Automatic1111 had use EMA during training option - this was significantly increasing VRAM usage but also quality
You don't have that feature?
Unfortuntaly, there is no EMA feature currently. I would like to support it, but I think other tasks have higher priority. Of course you can use another trainer :)
@kohya-ss anyway to set LR for text encoder?
it super fast get cooked :D
As I mentioned on X, we can use --block_lr
option to set LRs for each U-Net block. The default learning rate is used to Text Encoder.
More testing; I have edited extract_lora_from_models.py to always pass true for text encoder different.
I modified to increase MIN_DIFF before, but it seems to be too large. I will add an option to set MIN_DIFF sooner.
@kohya-ss
I used --block_lr and it works. text encoder not anymore cooked. here some comparisons
https://twitter.com/GozukaraFurkan/status/1710580153665925179
https://twitter.com/GozukaraFurkan/status/1710582243742142532
https://twitter.com/GozukaraFurkan/status/1710609957626810825
I used --block_lr and it works. text encoder not anymore cooked. here some comparisons
That's nice! I didn't know the prompt for images, but I feel the right image might represent well the prompt, for example the style and the background.
I found it difficult to follow the dialogue, because there are other things at stake. Has the Text Encoder problem been fixed under SD 1.5 or not?
I found it difficult to follow the dialogue, because there are other things at stake. Has the Text Encoder problem been fixed under SD 1.5 or not?
The issue I tested still exists in the new version, and the LoRa trained cannot be used. You can try the modifications I mentioned earlier, it may be useful to you.
The issue I tested still exists in the new version, and the LoRa trained cannot be used. You can try the modifications I mentioned earlier, it may be useful to you.
I compared the config, and there was only one line difference: torch_dtype="float32"
instead of torch_dtype=null
.
I guess the part in train_network.py is because of torch 2.0, and that's why it was changed from null
to float32
in the config. I don't have any other idea, because I guess the two are related.
I'm now using version 21.8.4 of the GUI, which @FurkanGozukara claims still had good training for SD 1.5 (and I did make good Lora's with it), and it already had the parameters you describe, so it's more likely that the bug is elsewhere.
The issue I tested still exists in the new version, and the LoRa trained cannot be used. You can try the modifications I mentioned earlier, it may be useful to you.
I compared the config, and there was only one line difference:
torch_dtype="float32"
instead oftorch_dtype=null
. I guess the part in train_network.py is because of torch 2.0, and that's why it was changed fromnull
tofloat32
in the config. I don't have any other idea, because I guess the two are related.I'm now using version 21.8.4 of the GUI, which @FurkanGozukara claims still had good training for SD 1.5 (and I did make good Lora's with it), and it already had the parameters you describe, so it's more likely that the bug is elsewhere.
I’m not sure where the problem lies, but you might be right.
For me, the so-called correctness is to reproduce the training results of SD1.5 before introducing SDXL. I found that when the author does not quote “openai/clip-vit-large-patch14”, the initial loss function of training will be different. And when the author later introduces
if torch.__version__ >= "2.0.0": # PyTorch 2.0.0 以上対応のxformersなら以下が使える
vae.set_use_memory_efficient_attention_xformers(args.xformers)
the trained SD1.5 lora will be completely damaged.
As for what you said about torch_dtype=“float32” , at this time we have already abandoned the reference to “openai/clip-vit-large-patch14”, and the training results are already different from before.
i am not sure but SDXL training is far superior atm
here you can see my pictures : i shared 180+ : https://civitai.com/user/SECourses
best config : https://www.patreon.com/posts/89213064
quick tutorial : https://www.youtube.com/watch?v=EEV8RPohsbw
@AIEXAAA I looked at this link and matched the parameters to the ones in the .py file and only that one line is different. Since torch 2.x has been made the default in the new kohya versions, I assume there is a correlation. https://huggingface.co/openai/clip-vit-large-patch14/blob/main/config.json
@FurkanGozukara thanks, but I want to train SD 1.5, not SDXL.
@AIEXAAA I looked at this link and matched the parameters to the ones in the .py file and only that one line is different. Since torch 2.x has been made the default in the new kohya versions, I assume there is a correlation. https://huggingface.co/openai/clip-vit-large-patch14/blob/main/config.json
@FurkanGozukara thanks, but I want to train SD 1.5, not SDXL.
for sd 1.5 i am still in research
my older tutorial still working great though since it has EMA support too
@AIEXAAA I looked at this link and matched the parameters to the ones in the .py file and only that one line is different. Since torch 2.x has been made the default in the new kohya versions, I assume there is a correlation. https://huggingface.co/openai/clip-vit-large-patch14/blob/main/config.json
I think I roughly understand what you’re saying, and when
“print(f"config: {text_model.config}”)
is displayed, the value is consistent with the author’s default cfg, but the problem still results in different outcomes.
As for the PyTorch issue, even if I update to 2.0 or 2.01, or even update this training program to the latest version, as long as I modify it in the way I mentioned earlier, then the results of SD1.5 lora training will be consistent with before introducing SDXL. Therefore, it’s hard to assert that it is related to PyTorch 2.0.
I think this issue is already solved, but #890 seems to exists. I will work on #890.
I'm glad you found the source of the problem. Looking forward to the fix! :)
Hello everyone, I am pretty glad to see someone finally was able to identify this issue. I am the creator and founder of Team Crystal Clear. Some of you might be familiar with the name, to others it may be new. This is an issue i myself brought up in august when I first trained Crystal Clear XL. Unfortunately at the time, everyone I mentioned the fact that I am unable to properly train on kohya due to faulty text encoders, I was dismissed and told this is a me related issue. So, presented with no other choice, my team and I got to work and fixed the issue so we can properly train Crystal Clear XL. It's not in my nature to make available broken releases given how most of the work we do are commissions for game developers, the automotive industry, stable diffusion service providers, the brands and apparel industry, instagram and only fans influencers and models, and many other various businesses. This means that pretty much every checkpoint other than CCXL was trained on broken text encoders since august. Now, I don't have the time to go through all this comments to know if the issue is fixed or not, but i'm looking forward to see how the future changes compare to the ones we made. And as for kohya, you might not remember, but i did bring this up to you on the civitai discord back in august.
Hello everyone, I am pretty glad to see someone finally was able to identify this issue. I am the creator and founder of Team Crystal Clear. Some of you might be familiar with the name, to others it may be new. This is an issue i myself brought up in august when I first trained Crystal Clear XL. Unfortunately at the time, everyone I mentioned the fact that I am unable to properly train on kohya due to faulty text encoders, I was dismissed and told this is a me related issue. So, presented with no other choice, my team and I got to work and fixed the issue so we can properly train Crystal Clear XL. It's not in my nature to make available broken releases given how most of the work we do are commissions for game developers, the automotive industry, stable diffusion service providers, the brands and apparel industry, instagram and only fans influencers and models, and many other various businesses. This means that pretty much every checkpoint other than CCXL was trained on broken text encoders since august. Now, I don't have the time to go through all this comments to know if the issue is fixed or not, but i'm looking forward to see how the future changes compare to the ones we made. And as for kohya, you might not remember, but i did bring this up to you on the civitai discord back in august.
I am also doing training for companies. So far only using UNET training. Results are great but after text encoder I am hoping we will get even better results
Hello everyone, I am pretty glad to see someone finally was able to identify this issue. I am the creator and founder of Team Crystal Clear. Some of you might be familiar with the name, to others it may be new. This is an issue i myself brought up in august when I first trained Crystal Clear XL. Unfortunately at the time, everyone I mentioned the fact that I am unable to properly train on kohya due to faulty text encoders, I was dismissed and told this is a me related issue. So, presented with no other choice, my team and I got to work and fixed the issue so we can properly train Crystal Clear XL. It's not in my nature to make available broken releases given how most of the work we do are commissions for game developers, the automotive industry, stable diffusion service providers, the brands and apparel industry, instagram and only fans influencers and models, and many other various businesses. This means that pretty much every checkpoint other than CCXL was trained on broken text encoders since august. Now, I don't have the time to go through all this comments to know if the issue is fixed or not, but i'm looking forward to see how the future changes compare to the ones we made. And as for kohya, you might not remember, but i did bring this up to you on the civitai discord back in august.
I am also doing training for companies. So far only using UNET training. Results are great but after text encoder I am hoping we will get even better results
It's great to meet you Furkan. I've always found the research you do and the dedication you have towards stable diffusion, nothing short of outstanding. You are a wonderful content maker and I fully support and recommend your work.
I have been struggling with the faulty text encoder for the last few weeks, and was hoping that it would be fixed with the November 11 [v21.1.1] update, but that does not seem to be the case. I am still getting "Text encoder is same. Extract U-Net only." when extracting LoRAs. Is anyone else having this problem? Found workarounds? Know when it will be fixed?
I have been struggling with the faulty text encoder for the last few weeks, and was hoping that it would be fixed with the November 11 [v21.1.1] update, but that does not seem to be the case. I am still getting "Text encoder is same. Extract U-Net only." when extracting LoRAs. Is anyone else having this problem? Found workarounds? Know when it will be fixed?
i am using bmaltais GUI dev2 branch and working great SDXL training
SD 1.5 TE is still not good for Lora training. Yesterday I tried the same training under 21.8.4 GUI and 22.1.1 (with updated kohya script) and got completely different results, in the latest version it was overcooked by the third epoch, while in 21.8.4 I got a perfect Lora.
I have been struggling with the faulty text encoder for the last few weeks, and was hoping that it would be fixed with the November 11 [v21.1.1] update, but that does not seem to be the case. I am still getting "Text encoder is same. Extract U-Net only." when extracting LoRAs. Is anyone else having this problem? Found workarounds? Know when it will be fixed?
i am using bmaltais GUI dev2 branch and working great SDXL training
Are you using Dreambooth or Finetune in the dev2 branch?
I trained loha on sdxl with the last two updates, tried various parameters and always had a hard time getting satisfactory results, could be due to a couple things.
error: unrecognized arguments: --train_text_encoder Apparently Kohya has removed this for 1.5 training and when the model for Dreambooth is only 2GB you know it does not have the TE when the model it trained from is 4.7GB.
sd 1.5 trains by default TE
Didn't for me, but I don't use 1.5 since 2.0 was released, just had to use it to help LyCORIS test something.
Here the executed command
accelerate launch --num_cpu_threads_per_process=2 "./train_db.py" --pretrained_model_name_or_path="/workspace/stable-diffusion-webui/models/Stable-diffusion/Realistic_Vision_V5.1.safetensors" --train_data_dir="/workspace/stable-diffusion-webui/models/Stable-diffusion/img" --reg_data_dir="/workspace/stable-diffusion-webui/models/Stable-diffusion/reg" --resolution="768,768" --output_dir="/workspace/stable-diffusion-webui/models/Stable-diffusion/model" --logging_dir="/workspace/stable-diffusion-webui/models/Stable-diffusion/log" --save_model_as=safetensors --full_bf16 --output_name="me_1e7" --lr_scheduler_num_cycles="4" --max_data_loader_n_workers="0" --learning_rate="1e-07" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="4160" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False weight_decay=0.01 --max_data_loader_n_workers="0" --bucket_reso_steps=64 --xformers --bucket_no_upscale --noise_offset=0.0
When text encoder is not trained it is supposed to print
Text Encoder is not trained.
This message is not printed either
So how do I know text encoder were not trained? Because I extracted LoRA and it says text encoder is same
I did 30 trainings and so many trainings are wasted because of this bug :/
@kohya-ss