huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.36k stars 5.25k forks source link

Inconsistent results from Flux Model when loaded differently #9439

Open emil-malina opened 2 weeks ago

emil-malina commented 2 weeks ago

Describe the bug

I've observed strange behavior when loading the Flux.1-dev model. There are two ways to load the model that produce different results if run with the same seed. One of the options is from the HF diffusers doc, the second one is inspired by the ai-toolkit repo

Reproduction

First option, use from_pretrained on FluxPipeline. Second, option load the pipeline piece by piece

Common initialization:

seed = 139
generator = torch.Generator(device="cpu")
base_model_path = "black-forest-labs/FLUX.1-dev"
dtype = torch.float16

def flush():
    torch.cuda.empty_cache()
    gc.collect()

Option 1:

txt2img_pipe = FluxPipeline.from_pretrained(
        base_model_path,
        torch_dtype=dtype,
).to("cuda")

txt2img_pipe.scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(base_model_path, subfolder="scheduler")

Option 2:

transformer = FluxTransformer2DModel.from_pretrained(
    base_model_path,
    subfolder="transformer",
    torch_dtype=dtype,
)
transformer.to("cuda", dtype=dtype)
flush()

scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(base_model_path, subfolder="scheduler")
print("Loading vae")
vae = AutoencoderKL.from_pretrained(base_model_path, subfolder="vae", torch_dtype=dtype)
vae.to("cuda", dtype=dtype)
flush()

print("Loading t5")
tokenizer_2 = T5TokenizerFast.from_pretrained(base_model_path, subfolder="tokenizer_2", torch_dtype=dtype)
text_encoder_2 = T5EncoderModel.from_pretrained(base_model_path, subfolder="text_encoder_2",
                                                torch_dtype=dtype)

text_encoder_2.to("cuda", dtype=dtype)
flush()

print("Loading clip")
text_encoder = CLIPTextModel.from_pretrained(base_model_path, subfolder="text_encoder", torch_dtype=dtype)
tokenizer = CLIPTokenizer.from_pretrained(base_model_path, subfolder="tokenizer", torch_dtype=dtype)
text_encoder.to("cuda", dtype=dtype)

txt2img_pipe = FluxPipeline(
    scheduler=scheduler,
    text_encoder=text_encoder,
    tokenizer=tokenizer,
    text_encoder_2=None,
    tokenizer_2=tokenizer_2,
    vae=vae,
    transformer=None,
).to("cuda")
txt2img_pipe.text_encoder_2 = text_encoder_2
txt2img_pipe.transformer = transformer

The inference code:

generator.manual_seed(seed)
num_inference_steps = 28
max_sequence_length = 256
num_outputs = 1
guidance_scale = 3.5

prompt = "Pink Tweed Crystal Embellished Cropped Jacket with Point Collar, Sparkling Buttoned Placket, and Chest Pockets"

args = {
  "prompt": [f"a product photo of {prompt} on even background"] * num_outputs,
  "guidance_scale": guidance_scale,
  "generator": generator,
  "num_inference_steps": num_inference_steps,
  "max_sequence_length": max_sequence_length,
  "output_type": "pil"
  "height": 1024,
  "width": 1024
}
output = txt2img_pipe(**args)

668357b6-d55f-4c74-9dd6-3a6e42da1fa1_toolkit_str1 0 668357b6-d55f-4c74-9dd6-3a6e42da1fa1_partial_str1 0 7dae23d9-42fd-44b3-a9e1-f5d9e6411206_toolkit_str1 0 7dae23d9-42fd-44b3-a9e1-f5d9e6411206_partial_str1 0

Logs

No response

System Info

diffusers == 0.31.0.dev0

Who can help?

No response

zetyquickly commented 2 weeks ago

I observe this as well. Different initialization ways are leading to slightly different results, even with the same seed.

Could anyone suggest a proper way to instantiate the Flux model, especially when we load LoRA?

asomoza commented 2 weeks ago

Hi, maybe it's because of the T5, it has a problem when casting the weights, related to #8604.

Maybe you can try doing an experiment with the T5 prompt empty and compare the results.

The proper way to load the model is with the from_pretrained. Loading the modules in a separate way is for training or for more specific needs where you need to have them separated.

yiyixuxu commented 1 week ago

@asomoza yes indeed! Especially if you notice option2 is always better than option1?

emil-malina commented 1 week ago

@asomoza @yiyixuxu Will this be resolved if #8604 is resolved?

ScottishFold007 commented 1 week ago
image

However, when using Flux, there is always a limitation of 77 tokens in length.

asomoza commented 1 week ago

Will this be resolved if https://github.com/huggingface/diffusers/issues/8604 is resolved?

Yes but it's not a diffusers problem, this is a transformer model, so the fix should come from them, the problem is that this doesn't seem to affect the text inference.

So the best fix right now is that you load the T5 using the official recommended method.

However, when using Flux, there is always a limitation of 77 tokens in length.

This isn't related to this issue, the limitation comes from the CLIP model not the T5 and all the training and the official code works like this. You can use sd_embed if you want to circumvent the limit.

emil-malina commented 1 week ago

Got it. Let me put it here.

When loading a Flux model (SD3 or PixArt) pipeline piece by piece in torch.float16 format, do it this way:

Set torch_dtype at loading:

base_model_path = base_model_path = "black-forest-labs/FLUX.1-dev"
dtype = torch.float16
text_encoder = T5EncoderModel.from_pretrained(base_model_path, ..., torch_dtype=dtype)

Instead of casting the model to a dtype afterwards:

base_model_path = base_model_path = "black-forest-labs/FLUX.1-dev"
dtype = torch.float16
text_encoder = T5EncoderModel.from_pretrained(base_model_path)
text_encoder.to(dtype=torch.float16)
emil-malina commented 1 week ago

I assume that will produce the same images, I need to validate that

emil-malina commented 1 week ago

Confirming that. Partial loading and direct loading produce similar results if the pipeline loaded as mentioned above

Partial Loading Direct Loading
9768d18a-b17d-44be-9485-1f6f47e557e2_partial_str0 9 9768d18a-b17d-44be-9485-1f6f47e557e2_direct_str0 9