ComfyUI example for train_dreambooth_lora_sdxl_advanced.py

Heimdall-Nss commented 3 months ago

Can you provide a ComfyUI example for the model in the train_dreambooth_lora_sdxl_advanced.py script?

I used the train_dreambooth_lora_sdxl_advanced.py script to train and perform inference, and achieved good results. I want to apply the trained lora and embedding on comfyUI, but the results are very poor? Refer to the inference process in https://github.com/huggingface/diffusers/tree/main/examples/advanced_diffusion_training, the local inference script is as follows：

import torch
# from huggingface_hub import hf_hub_download, upload_file
from diffusers import DiffusionPipeline
from diffusers.models import AutoencoderKL
from safetensors.torch import load_file
from PIL import Image
pretrain_model = "./pretrained_models/SDXL"
local_weights_path = "./cat1_sdxl_lora/checkpoint-200/pytorch_lora_weights.safetensors"

pipe = DiffusionPipeline.from_pretrained(
        pretrain_model,
        torch_dtype=torch.float16,
        variant="fp16",
).to("cuda")

# pipe.load_lora_weights(repo_id, weight_name="pytorch_lora_weights.safetensors")
pipe.load_lora_weights(local_weights_path)

text_encoders = [pipe.text_encoder, pipe.text_encoder_2]
tokenizers = [pipe.tokenizer, pipe.tokenizer_2]

# embedding_path = hf_hub_download(repo_id=repo_id, filename="embeddings.safetensors", repo_type="model")
embedding_path = "./cat1_sdxl_lora/checkpoint-100/cat1_sdxl_lora_emb.safetensors"

state_dict = load_file(embedding_path)
# load embeddings of text_encoder 1 (CLIP ViT-L/14)
pipe.load_textual_inversion(state_dict["clip_l"], token=["<s0>", "<s1>"], text_encoder=pipe.text_encoder, tokenizer=pipe.tokenizer)
# load embeddings of text_encoder 2 (CLIP ViT-G/14)
pipe.load_textual_inversion(state_dict["clip_g"], token=["<s0>", "<s1>"], text_encoder=pipe.text_encoder_2, tokenizer=pipe.tokenizer_2)

instance_token = "<s0><s1>"
prompt = f"a {instance_token} cat lying in front of the courtyard, in the background of Chinese style"

image = pipe(prompt=prompt, num_inference_steps=25, cross_attention_kwargs={"scale": 1.0}).images[0]

image.save("cats/cat14.jpg")

Below is my workflow. What is wrong? Or can you provide an example reference?

asomoza commented 3 months ago

Hi, can you share the lora and embeddings files? Also can you show an image you consider a good result?

Heimdall-Nss commented 2 months ago

Hi, can you share the lora and embeddings files? Also can you show an image you consider a good result?

I used 5 cat pictures for training, which looked like this:

After that, I chose the 100-step checkpoints model for testing. The local test script ：

import torch
# from huggingface_hub import hf_hub_download, upload_file
from diffusers import DiffusionPipeline
from diffusers.models import AutoencoderKL
from safetensors.torch import load_file
from PIL import Image
pretrain_model = "./pretrained_models/SDXL"
local_weights_path = "./cat1_sdxl_lora/checkpoint-100/pytorch_lora_weights.safetensors"
# embedding_path = hf_hub_download(repo_id=repo_id, filename="embeddings.safetensors", repo_type="model")
embedding_path = "./cat1_sdxl_lora/checkpoint-100/cat1_sdxl_lora_emb.safetensors"
torch.cuda.set_device(1)
pipe = DiffusionPipeline.from_pretrained(
        pretrain_model,
        torch_dtype=torch.float16,
        variant="fp16",
).to("cuda")

# pipe.load_lora_weights(repo_id, weight_name="pytorch_lora_weights.safetensors")
pipe.load_lora_weights(local_weights_path)

text_encoders = [pipe.text_encoder, pipe.text_encoder_2]
tokenizers = [pipe.tokenizer, pipe.tokenizer_2]

state_dict = load_file(embedding_path)
# load embeddings of text_encoder 1 (CLIP ViT-L/14)
pipe.load_textual_inversion(state_dict["clip_l"], token=["<s0>", "<s1>"], text_encoder=pipe.text_encoder, tokenizer=pipe.tokenizer)
# load embeddings of text_encoder 2 (CLIP ViT-G/14)
pipe.load_textual_inversion(state_dict["clip_g"], token=["<s0>", "<s1>"], text_encoder=pipe.text_encoder_2, tokenizer=pipe.tokenizer_2)

instance_token = "<s0><s1>"
prompt = f"a {instance_token} cat running on the beach"
print(pipe)
image = pipe(prompt=prompt, num_inference_steps=25, cross_attention_kwargs={"scale": 1.0}).images[0]

the model file :

the generated result :

After that, I built a simple workflow using lora and embedding in comfyUI as follows:

The corresponding model weights were selected, and the generated result was as follows:

note that after adjusting the latent size to 1024, the result will be better than 512, probably because the default input is 1024, but the picture texture is still different. In other words, can I reproduce the effect in the script on comfyUI？

asomoza commented 2 months ago

I'm asking about a lora and embeddings files because I don't have one and also I don't have the time to train one right now. So if you can't share yours, you'll have to wait until I have the time to train one.

There's a couple of tips I can give you though.

With SDXL you have to always use 1024px or an equivalent number of pixels resolution, if you use 512px generations you will almost always get bad results.

In comfyui, the embeddings gets loaded automatically by the filename but don't think it does the same for the token, so you have to use something like this in the prompt: a embedding:cat_epoch100 <s0><s1> cat running on the beach. I'm not completely sure about this because I almost never use TI so I'll have to test this.

The training you're doing is too simple, training 4 images of a generic cat will be very hard to differentiate if the TI is loading or not.

For the lora you need to use the arg for it to be comfyui compatible, don't know if you're doing that.

All from before is assuming you have the correct filenames in the correct directories, you have to look in the comfyui console output if everything gets loaded correctly or if you have errors/warnings.

Heimdall-Nss commented 2 months ago

I'm asking about a lora and embeddings files because I don't have one and also I don't have the time to train one right now. So if you can't share yours, you'll have to wait until I have the time to train one.

There's a couple of tips I can give you though.

With SDXL you have to always use 1024px or an equivalent number of pixels resolution, if you use 512px generations you will almost always get bad results.

In comfyui, the embeddings gets loaded automatically by the filename but don't think it does the same for the token, so you have to use something like this in the prompt: a embedding:cat_epoch100 <s0><s1> cat running on the beach. I'm not completely sure about this because I almost never use TI so I'll have to test this.

The training you're doing is too simple, training 4 images of a generic cat will be very hard to differentiate if the TI is loading or not.

For the lora you need to use the arg for it to be comfyui compatible, don't know if you're doing that.

All from before is assuming you have the correct filenames in the correct directories, you have to look in the comfyui console output if everything gets loaded correctly or if you have errors/warnings.

Thank you for your replies. I have tried to upload relevant files, but due to company regulations, I cannot use external network disks, etc. However, with your prompts, I seem to have found the problem. I actually ignored the log view on the comfyUI server. Obviously, as shown in the following figure, lora failed to load. I will continue to troubleshoot the error.

Heimdall-Nss commented 2 months ago

I'm asking about a lora and embeddings files because I don't have one and also I don't have the time to train one right now. So if you can't share yours, you'll have to wait until I have the time to train one.

There's a couple of tips I can give you though.

With SDXL you have to always use 1024px or an equivalent number of pixels resolution, if you use 512px generations you will almost always get bad results.

In comfyui, the embeddings gets loaded automatically by the filename but don't think it does the same for the token, so you have to use something like this in the prompt: a embedding:cat_epoch100 <s0><s1> cat running on the beach. I'm not completely sure about this because I almost never use TI so I'll have to test this.

The training you're doing is too simple, training 4 images of a generic cat will be very hard to differentiate if the TI is loading or not.

For the lora you need to use the arg for it to be comfyui compatible, don't know if you're doing that.

All from before is assuming you have the correct filenames in the correct directories, you have to look in the comfyui console output if everything gets loaded correctly or if you have errors/warnings.

lora and emb files are here：https://filetransfer.io/data-package/mlxBCa04#link

asomoza commented 2 months ago

those errors are because the lora is in the diffusers format, comfyui doesn't load them, to be able to use them in the webuis you'll need to use the --output_kohya_format arg.

I'll test your files now.

asomoza commented 2 months ago

I converted the lora and it seems to work, as I told you before, it's really hard to see if the TI is working or not or if needs the special tokens, but it seems so:

Just LoRA	LoRA + TI	LoRA + TI with token

Heimdall-Nss commented 2 months ago

I converted the lora and it seems to work, as I told you before, it's really hard to see if the TI is working or not or if needs the special tokens, but it seems so:

Just LoRA LoRA + TI LoRA + TI with token

Thank you very much for your help, I found a similar problem in issue: https://github.com/comfyanonymous/ComfyUI/issues/1144, and found this conversion script: https://github.com/huggingface/diffusers/blob/main/scripts/convert_diffusers_sdxl_lora_to_webui.py, I will continue to try

asomoza commented 2 months ago

yeah, that's the script I used to convert your LoRA just now.

huggingface / diffusers

ComfyUI example for train_dreambooth_lora_sdxl_advanced.py #8794