SDXL control net training inference output defer from wandb

brycegoh commented 7 months ago

Describe the bug

I am training a controlnet using the diffusers script. I have set it to save a checkpoint every 200 steps. However, when I try to use the safetensor file for inference, the output is completely different from the one reported in wandb.

The inference code is the same as the one in the README for sdxl controlnet training script.

Please advise, thanks!

Reproduction

Both wandb and inference outputs uses the same prompt and conditioning image.

Inference output:

Wandb reporting output:

This is the inference code:

from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, UniPCMultistepScheduler, AutoencoderKL
from diffusers.utils import load_image
import torch

base_model_path = "stabilityai/stable-diffusion-xl-base-1.0"
controlnet_path = "username/repo_id"

controlnet = ControlNetModel.from_pretrained(controlnet_path, torch_dtype=torch.float16, subfolder="checkpoint-200/controlnet")

vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)

pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    base_model_path, controlnet=controlnet, vae=vae, torch_dtype=torch.float16
)

# speed up diffusion process with faster scheduler and memory optimization
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
# remove following line if xformers is not installed or when using Torch 2.0.
pipe.enable_xformers_memory_efficient_attention()
# memory optimization.
pipe.enable_model_cpu_offload()

control_image = load_image("./conditioning_image").convert('RGB')
prompt = "example prompt"

image = pipe(
    prompt, num_inference_steps=100, image=control_image
).images[0]

the subfolder checkpoint-200/controlnet contains the diffusion_pytorch_model.safetensors and config.json

Logs

No response

System Info

Training on Runpod with a A40 GPU

Who can help?

@sayakpaul @patrickvonplaten

sayakpaul commented 7 months ago

Did you use the same VAE during training as well? And did you use FP16 during training?

Could you share your training command so that I can reproduce this?

Also, did you observe similar things when using the toy example shown in the README?

brycegoh commented 7 months ago

Did you use the same VAE during training as well? And did you use FP16 during training?

Could you share your training command so that I can reproduce this?

Also, did you observe similar things when using the toy example shown in the README?

Yes, I use the same VAE

Sure, apologies for not providing it initially

accelerate launch diffusers/examples/controlnet/train_controlnet_sdxl.py \
--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \
--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix \
--conditioning_image_column=conditioning_image \
--image_column=image \
--caption_column=text \
--dataset_name=$DATASET_NAME \
--mixed_precision="fp16" \
--resolution=1024 \
--learning_rate=1e-5 \
--lr_scheduler=cosine \
--num_train_epochs=2 \
--validation_image=$VALIDATION_IMG \
--validation_prompt="$VALIDATION_PROMPT" \
--validation_steps=$VALIDATION_STEPS \
--train_batch_size=7 \
--gradient_accumulation_steps=10 \
--hub_model_id=$HF_HUB_REPO_ID\
--report_to="wandb" \
--tracker_project_name=sdxl_cn \
--push_to_hub

No, I have not tried it with the example dataset

sayakpaul commented 7 months ago

Could you try 3?

brycegoh commented 7 months ago

Could you try 3?

Just tried the example command as listed here and I encountered the same issue. I trained for a total of 300 steps for debugging purposes. Here are the final weights and checkpoint weights as well

Training command:

accelerate launch diffusers/examples/controlnet/train_controlnet_sdxl.py  \
 --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \
 --dataset_name=fusing/fill50k \
 --mixed_precision="fp16" \
 --resolution=1024 \
 --learning_rate=1e-5 \
 --max_train_steps=300 \
 --checkpointing_steps=100 \
 --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
 --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
 --validation_steps=100 \
 --train_batch_size=1 \
 --gradient_accumulation_steps=4 \
 --hub_model_id=brycegoh/sdxl-cn-example \
 --seed=42 \
 --report_to="wandb" \
 --tracker_project_name=sdxl_cn_example \
 --push_to_hub

Inference code:

from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, UniPCMultistepScheduler, AutoencoderKL
from diffusers.utils import load_image
import torch

base_model_path = "stabilityai/stable-diffusion-xl-base-1.0"
controlnet_path = "brycegoh/sdxl-cn-example"

controlnet = ControlNetModel.from_pretrained(controlnet_path, torch_dtype=torch.float16)

pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    base_model_path, controlnet=controlnet, torch_dtype=torch.float16
)

pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_model_cpu_offload()

control_image = load_image("./conditioning_image_1.png").convert('RGB')
prompt = "red circle with blue background"

image = pipe(
    prompt, num_inference_steps=100, image=control_image
).images[0]

Inference output:

Wandb output:

sayakpaul commented 7 months ago

Very weird thing.

I just created this PR wherein we additionally call the log_validation() function after serializing the serializing ControlNet checkpoint. This is exactly the same behavior as the inference code. This helps validate if the trained checkpoint is effective enough. I am happy to be proven wrong if that's not the case.

I have also added a note for the ControlNet SDXL training script mentioning that you should ensure that you're using the same VAE you used during training.

SDXL ControlNet

Command:

accelerate launch train_controlnet_sdxl.py \
 --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
 --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
 --output_dir="controlnet-sdxl" \
 --dataset_name=fusing/fill50k \
 --max_train_samples=100 \
 --mixed_precision="fp16" \
 --resolution=1024 \
 --learning_rate=1e-5 \
 --max_train_steps=150 \
 --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
 --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
 --validation_steps=50 \
 --train_batch_size=1 \
 --gradient_accumulation_steps=4 \
 --report_to="wandb" \
 --seed=42 \
 --push_to_hub

Results:

Checkpoint: https://huggingface.co/sayakpaul/controlnet-sdxl
WandB: https://wandb.ai/sayakpaul/sd_xl_train_controlnet/runs/v6vkby93

SD ControlNet

Command:

accelerate launch train_controlnet.py \
 --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
 --output_dir="controlnet-sd" \
 --dataset_name=fusing/fill50k \
 --max_train_samples=100 \
 --mixed_precision="fp16" \
 --resolution=1024 \
 --learning_rate=1e-5 \
 --max_train_steps=150 \
 --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
 --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
 --validation_steps=50 \
 --train_batch_size=1 \
 --gradient_accumulation_steps=4 \
 --report_to="wandb" \
 --seed=42 \
 --push_to_hub

Results:

Checkpoint: https://huggingface.co/sayakpaul/controlnet-sd
WandB: https://wandb.ai/sayakpaul/train_controlnet/runs/d6k17epg

Hopefully, that helps?

rphly commented 7 months ago

@sayakpaul sorry, don't quite understand what are the steps to take from here. Should we try working from your branch? or is there something else.

brycegoh commented 7 months ago

Very weird thing.

I just created this PR wherein we additionally call the log_validation() function after serializing the serializing ControlNet checkpoint. This is exactly the same behavior as the inference code. This helps validate if the trained checkpoint is effective enough. I am happy to be proven wrong if that's not the case.

I have also added a note for the ControlNet SDXL training script mentioning that you should ensure that you're using the same VAE you used during training.

SDXL ControlNet

Command:
ccelerate launch train_controlnet_sdxl.py \
 --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
 --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
 --output_dir="controlnet-sdxl" \
 --dataset_name=fusing/fill50k \
 --max_train_samples=100 \
 --mixed_precision="fp16" \
 --resolution=1024 \
 --learning_rate=1e-5 \
 --max_train_steps=150 \
 --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
 --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
 --validation_steps=50 \
 --train_batch_size=1 \
 --gradient_accumulation_steps=4 \
 --report_to="wandb" \
 --seed=42 \
 --push_to_hub
Results:

Checkpoint: https://huggingface.co/sayakpaul/controlnet-sdxl

WandB: https://wandb.ai/sayakpaul/sd_xl_train_controlnet/runs/v6vkby93

SD ControlNet

Command:
accelerate launch train_controlnet.py \
 --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
 --output_dir="controlnet-sd" \
 --dataset_name=fusing/fill50k \
 --max_train_samples=100 \
 --mixed_precision="fp16" \
 --resolution=1024 \
 --learning_rate=1e-5 \
 --max_train_steps=150 \
 --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
 --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
 --validation_steps=50 \
 --train_batch_size=1 \
 --gradient_accumulation_steps=4 \
 --report_to="wandb" \
 --seed=42 \
 --push_to_hub
Results:

Checkpoint: https://huggingface.co/sayakpaul/controlnet-sd

WandB: https://wandb.ai/sayakpaul/train_controlnet/runs/d6k17epg

Hopefully, that helps?

Thanks @sayakpaul for the quick reply. I just tried your final output weights in my inference script and it seems to be having the same issue.

I posted my inference notebook that I ran on kaggle here (Public Repo Link). Please advise if I am getting something wrong?

Inference:

from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, UniPCMultistepScheduler, AutoencoderKL
from diffusers.utils import load_image
import torch

base_model_path = "stabilityai/stable-diffusion-xl-base-1.0"

controlnet_path = "sayakpaul/controlnet-sdxl"

controlnet = ControlNetModel.from_pretrained(controlnet_path, torch_dtype=torch.float16)

vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)

pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    base_model_path, vae=vae, controlnet=controlnet, torch_dtype=torch.float16
)

pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

pipe.enable_xformers_memory_efficient_attention()

pipe.enable_model_cpu_offload()

control_image = load_image("/kaggle/input/base-images/conditioning_image_1.png").convert('RGB')
prompt = "red circle with blue background"

image = pipe(
    prompt, num_inference_steps=100, image=control_image
).images[0]

Output:

Expected output based on your wandb report:

sayakpaul commented 7 months ago

Could you test the latest changes i.e., https://github.com/huggingface/diffusers/pull/7096/commits/937c66bf23d91a95113f4da1d1836da757551066?

Additionally, I would suggest matching the inference logic as close to the one used during logging as possible:

from diffusers import UniPCMultistepScheduler, ControlNetModel, AutoencoderKL, StableDiffusionXLControlNetPipeline
from diffusers.utils import load_image, make_image_grid
import torch

pipeline_id = "stabilityai/stable-diffusion-xl-base-1.0"
vae_id = "madebyollin/sdxl-vae-fp16-fix"
controlnet_id = "sayakpaul/controlnet-sdxl"

controlnet = ControlNetModel.from_pretrained(controlnet_id, torch_dtype=torch.float16)
vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16)
pipeline = StableDiffusionXLControlNetPipeline.from_pretrained(
    pipeline_id, controlnet=controlnet, vae=vae, torch_dtype=torch.float16
).to("cuda")
pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config)

control_image = load_image("conditioning_image_1.png").convert("RGB").resize((1024, 1024))
prompt = "red circle with blue background"
generator = torch.Generator(device="cuda").manual_seed(42)
images = pipeline(
    prompt=prompt, image=control_image, num_images_per_prompt=4, num_inference_steps=20, generator=generator
).images
make_image_grid([control_image] + images, 1, 5).save("image_grid.png")

yiyixuxu commented 7 months ago

the only issue here is the control_image isn't resized to 1024,1024 there is nothing wrong with the training script; let's update the inference code in readme @sayakpaul

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

littletomatodonkey commented 1 month ago

met the same problem, even only 16 samples with 500 iters can not overfit and make good results.: https://github.com/huggingface/diffusers/issues/9179

huggingface / diffusers