Open CrackerHax opened 1 week ago
It seems like the text_encoder is being loaded onto the GPU, while the transformer is running on the CPU.
hi @CrackerHax
thanks for the issue!
first, the BitsAndBytesConfig
you imported from transformers
library won't work with the flux transformer (it is a diffusers model, and similar bnb support will be added in this PR https://github.com/huggingface/diffusers/pull/9213)
and second, because the transformer is larger than 11G it was placed in cpu, you can find out where each component is placed with pipe.hf_device_map
Please refer to this doc to know how to do this more appropriately: https://huggingface2.notion.site/How-to-split-Flux-transformer-and-run-inference-aa1583ad23ce47a78589a79bb9309ab0.
This post isn't about doing it appropriately, it's about consistency. One would intuitively expect it to work the same way in both cases. If it is appropriate to do it the way you suggest it should be made clear in documentation.
I ended up working around the problem by replacing the schnell transformer with the custom one in the transformer folder, running it as a full pipeline mapped to GPU 1 and running text_encoder_2 mapped to GPU 2.
Yeah my bad. We're working on the documentation (cc: @stevhliu).
Pipeline and model-level device mapping are different because unifying them complicates the resource allocation process (as a diffusion system is a series of models and not just a single model). Since this is still fairly new, the documentation hasn't included it yet, but I do expect that to change very soon.
If you're using two GPUs, I would highly recommend using the approach from https://huggingface2.notion.site/How-to-split-Flux-transformer-and-run-inference-aa1583ad23ce47a78589a79bb9309ab0.
Yeah my bad. We're working on the documentation (cc: @stevhliu).
Pipeline and model-level device mapping are different because unifying them complicates the resource allocation process (as a diffusion system is a series of models and not just a single model). Since this is still fairly new, the documentation hasn't included it yet, but I do expect that to change very soon.
If you're using two GPUs, I would highly recommend using the approach from https://huggingface2.notion.site/How-to-split-Flux-transformer-and-run-inference-aa1583ad23ce47a78589a79bb9309ab0.
I checked your recommended approach. However this line of code returns a NotImplementedError :/
transformer = FluxTransformer2DModel.from_pretrained(
ckpt_id,
subfolder="transformer",
device_map="auto",
max_memory={0: "12GB", 1: "12GB"},
torch_dtype=torch.bfloat16,
cache_dir="models"
)
ValueError: FluxTransformer2DModel does not support `device_map='auto'`. To implement support, the model class needs to implement the `_no_split_modules` attribute.
Even though in the tutorial it is mentioned as it is implemented. Why is this happenning any ideas?
Can you provide your code snippet that you're using?
_no_split_modules
is defined here:
https://github.com/huggingface/diffusers/blob/aa73072f1f7014635e3de916cbcf47858f4c37a0/src/diffusers/models/transformers/transformer_flux.py#L225
Describe the bug
When I load the text_encoder like this:
Everything works fine. But when I try to define the transformer the same way:
I get this error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)
It seems some consistency would be expected here.
Reproduction
Logs
No response
System Info
Who can help?
No response