RuntimeError: Input type (c10::Half) and bias type (float) mismatch in training_text_to_image_lora_sdxl.py

Describe the bug

I'm encountering the same error as described in the closed issue #4478.

I'm currently running the train_text_to_image_lora_sdxl.py script, and the VAE give me the following error:

RuntimeError: Input type (c10::Half) and bias type (float) should be the same

See "Reproduction", "Logs", and "System Info" for all the details.

Any idea why? Do you need more details or do you want I run other experiments?

Thanks!

Reproduction

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch train_text_to_image_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --caption_column="text" \
  --resolution=1024 \
  --random_flip \
  --train_batch_size=1 \
  --num_train_epochs=2 \
  --gradient_accumulation_steps=1 \
  --checkpointing_steps=500 \
  --learning_rate=1e-04 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --dataloader_num_workers=0 \
  --report_to="wandb" \
  --seed=42 \
  --output_dir="sd-pokemon-model-lora-sdxl-txt" \
  --train_text_encoder \
  --validation_prompt="cute dragon creature" \
  --mixed_precision="fp16" \
  --rank=4

Logs

08/15/2023 18:41:26 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'dynamic_thresholding_ratio', 'clip_sample_range', 'thresholding', 'variance_type'} was not found in config. Values will be initialized to default values.
wandb: Currently logged in as: mnslarcher. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.8
wandb: Run data is saved locally in /home/mnslarcher/ai/sd-xl-hands/wandb/run-20230815_184142-flioaupp
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run wobbly-resonance-5
wandb: ⭐️ View project at https://wandb.ai/mnslarcher/text2image-fine-tune
wandb: 🚀 View run at https://wandb.ai/mnslarcher/text2image-fine-tune/runs/flioaupp
08/15/2023 18:41:46 - INFO - __main__ - ***** Running training *****
08/15/2023 18:41:46 - INFO - __main__ -   Num examples = 833
08/15/2023 18:41:46 - INFO - __main__ -   Num Epochs = 2
08/15/2023 18:41:46 - INFO - __main__ -   Instantaneous batch size per device = 1
08/15/2023 18:41:46 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
08/15/2023 18:41:46 - INFO - __main__ -   Gradient Accumulation steps = 1
08/15/2023 18:41:46 - INFO - __main__ -   Total optimization steps = 1666
Steps:   0%|                                                                                                                                                                              | 0/1666 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/mnslarcher/ai/sd-xl-hands/train_text_to_image_lora_sdxl.py", line 1281, in <module>
    main(args)
  File "/home/mnslarcher/ai/sd-xl-hands/train_text_to_image_lora_sdxl.py", line 1008, in main
    model_input = vae.encode(pixel_values).latent_dist.sample()
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 242, in encode
    h = self.encoder(x)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/vae.py", line 110, in forward
    sample = self.conv_in(sample)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (c10::Half) and bias type (float) should be the same
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: 🚀 View run wobbly-resonance-5 at: https://wandb.ai/mnslarcher/text2image-fine-tune/runs/flioaupp
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230815_184142-flioaupp/logs
Traceback (most recent call last):
  File "/home/mnslarcher/anaconda3/envs/hands/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command
    simple_launcher(args)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/mnslarcher/anaconda3/envs/hands/bin/python', 'train_text_to_image_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--dataset_name=lambdalabs/pokemon-blip-captions', '--caption_column=text', '--resolution=1024', '--random_flip', '--train_batch_size=1', '--num_train_epochs=2', '--gradient_accumulation_steps=1', '--checkpointing_steps=500', '--learning_rate=1e-04', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--dataloader_num_workers=0', '--report_to=wandb', '--seed=42', '--output_dir=sd-pokemon-model-lora-sdxl-txt', '--train_text_encoder', '--validation_prompt=cute dragon creature', '--mixed_precision=fp16', '--rank=4']' returned non-zero exit status 1.

System Info

OS Name: Ubuntu 22.04.3 LTS GPU: NVIDIA GeForce RTX 4090

diffusers-cli env:

- `diffusers` version: 0.19.3
- Platform: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Huggingface_hub version: 0.16.4
- Transformers version: 4.31.0
- Accelerate version: 0.21.0
- xFormers version: not installed
- Using GPU in script?: YES
- Using distributed or parallel set-up in script?: NO

enviroment.yml (conda):

name: myenv
channels:
  - defaults
dependencies:
  - nb_conda_kernels
  - ipykernel
  - jupyter
  - pip
  - python=3.10
  - pip:
    - accelerate==0.21.0
    - datasets==2.14.4
    - diffusers==0.19.3
    - ftfy==6.1.1
    - Jinja2==3.1.2
    - tensorboard==2.14.0
    - torch==2.0.1
    - torchvision==0.15.2
    - transformers==4.31.0
    - wandb==0.15.8

default_config.yaml:

compute_environment: LOCAL_MACHINE
distributed_type: 'NO'
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Who can help?

@sayak

huggingface / diffusers

RuntimeError: Input type (c10::Half) and bias type (float) mismatch in training_text_to_image_lora_sdxl.py #4619

Describe the bug

Reproduction

Logs

System Info

Who can help?