controlnet with sdxl infer black images even after rebasing #4038

yutongli commented 1 year ago

Describe the bug

I have been tightly following our amazing https://github.com/huggingface/diffusers/pull/4038, I got the new code and tried training for 10000 steps, training works however validation images are all black. I assume with #4038, we should have fixed the black image issue. Any clues?

Reproduction

here's my training config:

export MODEL_DIR="stabilityai/stable-diffusion-xl-base-0.9" export VAE_DIR="madebyollin/sdxl-vae-fp16-fix" export OUTPUT_DIR="product_train_output_extract_1stbatch_100k_sdxl0.9"

accelerate launch --mixed_precision="fp16" --multi_gpu train_controlnet_sdxl.py \ --pretrained_model_name_or_path=$MODEL_DIR \ --output_dir=$OUTPUT_DIR \ --pretrained_vae_model_name_or_path=$VAE_DIR \ --dataset_name=all_training_full_extract \ --image_column="target" \ --conditioning_image_column="source" \ --caption_column="prompt" \ --resolution=768 \ --learning_rate=2e-5 \ --validation_image "./val1_extract_source.jpg" "./val2_extract_source.jpg" "./val3_extract_source.jpg" "./popchange.png" \ --validation_prompt "a white trash can sitting on a table next to a plant" "a bottle of liquid with flower in it" "a rack with a bunch of shoes on it" "a doll in galaxy" \ --train_batch_size=1 \ --gradient_accumulation_steps=8 \ --tracker_project_name="product_train_output_extract_1stbatch_100k_sdxl0.9" \ --num_train_epochs=20 \ --report_to=wandb \ --resume_from_checkpoint="latest"

Results:

validation images during training are black after training about 500 steps.
I then continued training for 10000steps and ran inference with a checkpoint model, using our example code https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md#inference, however the inference images are also black.

Logs

No response

System Info

diffusers version: 0.19.0.dev0
Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.29
Python version: 3.8.10
PyTorch version (GPU?): 2.0.1+cu117 (True)
Huggingface_hub version: 0.16.4
Transformers version: 4.31.0
Accelerate version: 0.21.0
xFormers version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Single GPU

Who can help?

@sayakpaul @patrickvonplaten

sayakpaul commented 1 year ago

What happens when the resolution is changed to 1024 from 768?

yutongli commented 1 year ago

What happens when the resolution is changed to 1024 from 768?

Sure, let me try that, just kicked off a new run with 1024. Will update later

sayakpaul commented 1 year ago

Cool. Additionally, I will also recommend experimenting with learning rates. I don't what kind of dataset you're using, so cannot comment exhaustively, though.

patrickvonplaten commented 1 year ago

I'd recommend using learning_rate=1e-05. For me the following command works pretty well:

train_controlnet_webdatasets.py --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-0.9 --pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix --output_dir=controlnet-0-9-canny --mixed_precision=fp16 --resolution=1024 --learning_rate=1e-5 --max_train_steps=30000 --max_train_samples=3000000 --dataloader_num_workers=4 --validation_image ./c_image_0.png ./c_image_1.png ./c_image_2.png ./c_image_3.png ./c_image_4.png ./c_image_5.png ./c_image_6.png ./c_image_7.png --validation_prompt "two birds" "a snowy mountain" "a lake with clouds" "a woman using her phone" "a couple getting married" "a wedding" "a house at a lake" "a boat in nature" --train_shards_path_or_url "pipe:aws s3 cp s3://muse-datasets/laion-aesthetic6plus-data/{00000..01208}.tar -" --eval_shards_path_or_url "pipe:aws s3 cp s3://muse-datasets/laion-aesthetic6plus-data/{01209..01210}.tar -" --proportion_empty_prompts 0.5 --validation_steps=1000 --train_batch_size=12 --gradient_checkpointing --use_8bit_adam --enable_xformers_memory_efficient_attention --gradient_accumulation_steps=1 --report_to=wandb --seed=42 --push_to_hub

See some intermediate results here: https://wandb.ai/patrickvonplaten/sd_xl_train_controlnet/runs/7by0en10?workspace=

I strongly recommend using xformers as well.

yutongli commented 1 year ago

What happens when the resolution is changed to 1024 from 768?

Sure, let me try that, just kicked off a new run with 1024. Will update later

I got the results now, previously with 768 running 2000steps started to show black images, now with 1024 running around 4000 steps starts to show black images. let me try different learning rate

yutongli commented 1 year ago

I'd recommend using learning_rate=1e-05. For me the following command works pretty well:

train_controlnet_webdatasets.py --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-0.9 --pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix --output_dir=controlnet-0-9-canny --mixed_precision=fp16 --resolution=1024 --learning_rate=1e-5 --max_train_steps=30000 --max_train_samples=3000000 --dataloader_num_workers=4 --validation_image ./c_image_0.png ./c_image_1.png ./c_image_2.png ./c_image_3.png ./c_image_4.png ./c_image_5.png ./c_image_6.png ./c_image_7.png --validation_prompt "two birds" "a snowy mountain" "a lake with clouds" "a woman using her phone" "a couple getting married" "a wedding" "a house at a lake" "a boat in nature" --train_shards_path_or_url "pipe:aws s3 cp s3://muse-datasets/laion-aesthetic6plus-data/{00000..01208}.tar -" --eval_shards_path_or_url "pipe:aws s3 cp s3://muse-datasets/laion-aesthetic6plus-data/{01209..01210}.tar -" --proportion_empty_prompts 0.5 --validation_steps=1000 --train_batch_size=12 --gradient_checkpointing --use_8bit_adam --enable_xformers_memory_efficient_attention --gradient_accumulation_steps=1 --report_to=wandb --seed=42 --push_to_hub

See some intermediate results here: https://wandb.ai/patrickvonplaten/sd_xl_train_controlnet/runs/7by0en10?workspace=

I strongly recommend using xformers as well.

Thank you. My current torch version is 2.0.1. would using xformers requires torch version < 2.0?

sayakpaul commented 1 year ago

Thank you. My current torch version is 2.0.1. would using xformers requires torch version < 2.0?

No, it won't,

yutongli commented 1 year ago

I started a new run with the updated parameters from https://github.com/huggingface/diffusers/issues/4185#issuecomment-1645789362, will update my results later. Thanks!

yutongli commented 1 year ago

I started a new run with the updated parameters from #4185 (comment), will update my results later. Thanks!

The results start to show black images after 1500steps and the loss also increased drastically thereafter https://wandb.ai/intuitivemachine/product_train_output_extract_1stbatch_100k_sdxl0.9_1024_lr1/runs/ditejr6q

yutongli commented 1 year ago

does the following log(bold font) happening at the beginning of each validation look expected?

07/23/2023 03:43:55 - INFO - main - Running validation... {'controlnet'} was not found in config. Values will be initialized to default values. Loaded scheduler as EulerDiscreteScheduler from scheduler subfolder of stabilityai/stable-diffusion-xl-base-0.9.7 [00:00<?, ?it/s] Loaded text_encoder_2 as CLIPTextModelWithProjection from text_encoder_2 subfolder of stabilityai/stable-diffusion-xl-base-0.9. Loaded tokenizer_2 as CLIPTokenizer from tokenizer_2 subfolder of stabilityai/stable-diffusion-xl-base-0.9.00:01<00:02, 1.73it/s] Loaded tokenizer as CLIPTokenizer from tokenizer subfolder of stabilityai/stable-diffusion-xl-base-0.9. Loaded text_encoder as CLIPTextModel from text_encoder subfolder of stabilityai/stable-diffusion-xl-base-0.9.:01<00:00, 4.84it/s] Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 4.27it/s] {'lower_order_final', 'predict_x0', 'disable_corrector', 'solver_order', 'solver_type', 'solver_p'} was not found in config. Values will be initialized to default values.

sayakpaul commented 1 year ago

That should not be a problem.

Could you take an intermediate checkpoint of the ControlNet and run inference to see if they're also the same? This comment (https://github.com/huggingface/diffusers/pull/4038#issue-1798660497) has an example of how to do this.

If this config was a problem I think we would have gotten black images from the start and not after 4000 steps as you mentioned here: https://github.com/huggingface/diffusers/issues/4185#issuecomment-1645786463.

yutongli commented 1 year ago

I did run inference separately with a saved checkpoint after I saw the validation during training generated black images, but no luck. Yes, it also confused me that black images started to show after some steps as mentioned (1) https://github.com/huggingface/diffusers/issues/4185#issuecomment-1645786463 and (2) https://github.com/huggingface/diffusers/issues/4185#issuecomment-1646329874 I have taken a new commit after https://github.com/huggingface/diffusers/pull/4038 and before main(which was broken https://github.com/huggingface/diffusers/issues/4206#issue-1816657535), it seems working with the same config I used above. I am monitoring

sayakpaul commented 1 year ago

I wonder what changed.

Anyway, if it's working (which is great news, of course), could we maybe close this issue?

AmericanPresidentJimmyCarter commented 1 year ago

OK I hit this bug training in bfloat16. If you are in bfloat16 the VAE is being run in fp16 breaking it. To fix this use the corrected fp16 VAE until this issue is fixed. Extremely annoying

huggingface / diffusers