huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.33k stars 5.25k forks source link

FluxPipeline - Multi-GPU Issue - When you define transformer= you get "Expected all tensors to be on the same device" #9450

Open CrackerHax opened 1 week ago

CrackerHax commented 1 week ago

Describe the bug

When I load the text_encoder like this:

model_id = "black-forest-labs/FLUX.1-schnell"
text_encoder = T5EncoderModel.from_pretrained(
    model_id,
    subfolder="text_encoder_2",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16
)
pipe = FluxPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16,
                                         text_encoder_2=text_encoder,
                                         device_map="balanced", max_memory={0:"11GiB", 1:"11GiB", "cpu":"20GiB"})
pipe.vae.enable_tiling()

Everything works fine. But when I try to define the transformer the same way:

from diffusers.models.transformers.transformer_flux import FluxTransformer2DModel
transformer: FluxTransformer2DModel = FluxTransformer2DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)

pipe = FluxPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16,
                                         transformer=transformer,
                                         device_map="balanced", max_memory={0:"11GiB", 1:"11GiB", "cpu":"20GiB"})
pipe.vae.enable_tiling()

I get this error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

It seems some consistency would be expected here.

Reproduction

import torch
from diffusers import DiffusionPipeline, FluxPipeline
from transformers import T5EncoderModel, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_id = "black-forest-labs/FLUX.1-schnell"
text_encoder = T5EncoderModel.from_pretrained(
    model_id,
    subfolder="text_encoder_2",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16
)

from diffusers.models.transformers.transformer_flux import FluxTransformer2DModel
transformer: FluxTransformer2DModel = FluxTransformer2DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)

pipe = FluxPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16,
                                         text_encoder_2=text_encoder,
                                         transformer=transformer,
                                         device_map="balanced", max_memory={0:"11GiB", 1:"11GiB", "cpu":"20GiB"})
pipe.vae.enable_tiling()

Logs

No response

System Info

Who can help?

No response

riteshrm commented 1 week ago

It seems like the text_encoder is being loaded onto the GPU, while the transformer is running on the CPU.

yiyixuxu commented 1 week ago

hi @CrackerHax thanks for the issue! first, the BitsAndBytesConfig you imported from transformers library won't work with the flux transformer (it is a diffusers model, and similar bnb support will be added in this PR https://github.com/huggingface/diffusers/pull/9213) and second, because the transformer is larger than 11G it was placed in cpu, you can find out where each component is placed with pipe.hf_device_map

sayakpaul commented 1 week ago

Please refer to this doc to know how to do this more appropriately: https://huggingface2.notion.site/How-to-split-Flux-transformer-and-run-inference-aa1583ad23ce47a78589a79bb9309ab0.

CrackerHax commented 1 week ago

This post isn't about doing it appropriately, it's about consistency. One would intuitively expect it to work the same way in both cases. If it is appropriate to do it the way you suggest it should be made clear in documentation.

CrackerHax commented 1 week ago

I ended up working around the problem by replacing the schnell transformer with the custom one in the transformer folder, running it as a full pipeline mapped to GPU 1 and running text_encoder_2 mapped to GPU 2.

sayakpaul commented 1 week ago

Yeah my bad. We're working on the documentation (cc: @stevhliu).

Pipeline and model-level device mapping are different because unifying them complicates the resource allocation process (as a diffusion system is a series of models and not just a single model). Since this is still fairly new, the documentation hasn't included it yet, but I do expect that to change very soon.

If you're using two GPUs, I would highly recommend using the approach from https://huggingface2.notion.site/How-to-split-Flux-transformer-and-run-inference-aa1583ad23ce47a78589a79bb9309ab0.

UmutAlihan commented 6 days ago

Yeah my bad. We're working on the documentation (cc: @stevhliu).

Pipeline and model-level device mapping are different because unifying them complicates the resource allocation process (as a diffusion system is a series of models and not just a single model). Since this is still fairly new, the documentation hasn't included it yet, but I do expect that to change very soon.

If you're using two GPUs, I would highly recommend using the approach from https://huggingface2.notion.site/How-to-split-Flux-transformer-and-run-inference-aa1583ad23ce47a78589a79bb9309ab0.

I checked your recommended approach. However this line of code returns a NotImplementedError :/

transformer = FluxTransformer2DModel.from_pretrained(
    ckpt_id, 
    subfolder="transformer",
    device_map="auto",
    max_memory={0: "12GB", 1: "12GB"},
    torch_dtype=torch.bfloat16,
    cache_dir="models"
)
ValueError: FluxTransformer2DModel does not support `device_map='auto'`. To implement support, the model class needs to implement the `_no_split_modules` attribute.

Even though in the tutorial it is mentioned as it is implemented. Why is this happenning any ideas?

image
sayakpaul commented 6 days ago

Can you provide your code snippet that you're using?

_no_split_modules is defined here: https://github.com/huggingface/diffusers/blob/aa73072f1f7014635e3de916cbcf47858f4c37a0/src/diffusers/models/transformers/transformer_flux.py#L225