dtype error when using controlnet fp32 and mainpipe bf16

PromeAIpro commented 1 month ago

Describe the bug

An error occurs when loading controlnet as fp32 and loading mainpipe as bf16

Reproduction

import torch
from diffusers.utils import load_image
from diffusers.pipelines.flux.pipeline_flux_controlnet import FluxControlNetPipeline
from diffusers.models.controlnet_flux import FluxControlNetModel

base_model = 'black-forest-labs/FLUX.1-dev'
controlnet_model = 'promeai/FLUX.1-controlnet-lineart-promeai'
controlnet = FluxControlNetModel.from_pretrained(controlnet_model, torch_dtype=torch.float32)
pipe = FluxControlNetPipeline.from_pretrained(base_model, controlnet=controlnet, torch_dtype=torch.bfloat16)
pipe.to("cuda")

control_image = load_image("https://huggingface.co/promeai/FLUX.1-controlnet-lineart-promeai/resolve/main/images/example-control.jpg")
prompt = "cute anime girl with massive fluffy fennec ears and a big fluffy tail blonde messy long hair blue eyes wearing a maid outfit with a long black gold leaf pattern dress and a white apron mouth open holding a fancy black forest cake with candles on top in the kitchen of an old dark Victorian mansion lit by candlelight with a bright window to the foggy forest and very expensive stuff everywhere"
image = pipe(
    prompt, 
    control_image=control_image,
    controlnet_conditioning_scale=0.6,
    num_inference_steps=28, 
    guidance_scale=3.5,
).images[0]
image.save("./image.jpg")

Logs

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 12.20it/s]
Loading pipeline components...:  86%|███████████████████████████████████████████████████████████████████████████████▋             | 6/7 [00:00<00:00,  5.94it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  6.64it/s]
  0%|                                                                                                                                    | 0/28 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/data3/home/srchen/PROJECT/flux_controlnet_annotator/test_controlnet.py", line 14, in <module>
    image = pipe(
  File "/home/srchen/miniconda3/envs/train_flux_controlnet/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data3/home/srchen/test_diffusers/diffusers/src/diffusers/pipelines/flux/pipeline_flux_controlnet.py", line 849, in __call__
    controlnet_block_samples, controlnet_single_block_samples = self.controlnet(
  File "/home/srchen/miniconda3/envs/train_flux_controlnet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/srchen/miniconda3/envs/train_flux_controlnet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data3/home/srchen/test_diffusers/diffusers/src/diffusers/models/controlnet_flux.py", line 270, in forward
    hidden_states = self.x_embedder(hidden_states)
  File "/home/srchen/miniconda3/envs/train_flux_controlnet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/srchen/miniconda3/envs/train_flux_controlnet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/srchen/miniconda3/envs/train_flux_controlnet/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 117, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype, but got BFloat16 and Float

System Info

🤗 Diffusers version: 0.31.0.dev0
Platform: Linux-5.10.0-28-amd64-x86_64-with-glibc2.31
Running on Google Colab?: No
Python version: 3.10.14
PyTorch version (GPU?): 2.4.1+cu121 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Huggingface_hub version: 0.24.7
Transformers version: 4.44.2
Accelerate version: 0.34.2
PEFT version: not installed
Bitsandbytes version: not installed
Safetensors version: 0.4.5
xFormers version: not installed
Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB NVIDIA A100-SXM4-80GB, 81920 MiB
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@sayakpaul @yiyixuxu

PromeAIpro commented 1 month ago

Can inference normally by using autocast_ctx, but a black picture appears

import torch
from diffusers.utils import load_image
from diffusers.pipelines.flux.pipeline_flux_controlnet import FluxControlNetPipeline
from diffusers.models.controlnet_flux import FluxControlNetModel

base_model = 'black-forest-labs/FLUX.1-dev'
controlnet_model = 'promeai/FLUX.1-controlnet-lineart-promeai'

controlnet = FluxControlNetModel.from_pretrained(controlnet_model, torch_dtype=torch.float32)
pipe = FluxControlNetPipeline.from_pretrained(base_model, controlnet=controlnet, torch_dtype=torch.bfloat16)
pipe.to("cuda")

autocast_ctx = torch.autocast('cuda')

control_image = load_image("https://huggingface.co/promeai/FLUX.1-controlnet-lineart-promeai/resolve/main/images/example-control.jpg")
prompt = "cute anime girl with massive fluffy fennec ears and a big fluffy tail blonde messy long hair blue eyes wearing a maid outfit with a long black gold leaf pattern dress and a white apron mouth open holding a fancy black forest cake with candles on top in the kitchen of an old dark Victorian mansion lit by candlelight with a bright window to the foggy forest and very expensive stuff everywhere"
with autocast_ctx:
    image = pipe(
        prompt, 
        control_image=control_image,
        controlnet_conditioning_scale=0.6,
        num_inference_steps=28, 
        guidance_scale=3.5,
    ).images[0]
    image.save("./image.jpg")

PromeAIpro commented 1 month ago

This is because t5 does not support autocast_ctx and will output nan.

sayakpaul commented 1 month ago

I think this becomes only valid when training and running intermediate validation, right?

PromeAIpro commented 1 month ago

yes, happens when running intermediate validation

PromeAIpro commented 1 month ago

in other training script we use autocast_ctx, but I'm dont know why t5 outputs NaN, maybe we should fix this.

sayakpaul commented 1 month ago

Cc: @ArthurZucker from the transformers team. Have you seen this issue i.e., inference with T5 under autocast is unstable?

sayakpaul commented 1 month ago

On the other hand, if it's a training only thing (that too for intermediate validation), I think we should try to handle it from the training script, instead. Could we try that?

PromeAIpro commented 1 month ago

Is there a way for diffusers to clone controlnet? we consider cloning a copy and converting it to bf16 for validation, or just load pipeline using fp32 (the memory cost a lot) @sayakpaul

PromeAIpro commented 1 month ago

can work with pre calculate the text emb and do autocast.

ArthurZucker commented 1 month ago

Yep, training with T5 under autocast / different precision is hard, see this https://github.com/huggingface/transformers/issues/20287, lots of linked issues about the training and post training that's hard with t5 in particular

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul commented 3 weeks ago

I guess this issue is now resolved?

huggingface / diffusers