Inconsistent results from Flux Model when loaded differently

emil-malina commented 1 month ago

Describe the bug

I've observed strange behavior when loading the Flux.1-dev model. There are two ways to load the model that produce different results if run with the same seed. One of the options is from the HF diffusers doc, the second one is inspired by the ai-toolkit repo

Reproduction

First option, use from_pretrained on FluxPipeline. Second, option load the pipeline piece by piece

Common initialization:

seed = 139
generator = torch.Generator(device="cpu")
base_model_path = "black-forest-labs/FLUX.1-dev"
dtype = torch.float16

def flush():
    torch.cuda.empty_cache()
    gc.collect()

Option 1:

txt2img_pipe = FluxPipeline.from_pretrained(
        base_model_path,
        torch_dtype=dtype,
).to("cuda")

txt2img_pipe.scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(base_model_path, subfolder="scheduler")

Option 2:

transformer = FluxTransformer2DModel.from_pretrained(
    base_model_path,
    subfolder="transformer",
    torch_dtype=dtype,
)
transformer.to("cuda", dtype=dtype)
flush()

scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(base_model_path, subfolder="scheduler")
print("Loading vae")
vae = AutoencoderKL.from_pretrained(base_model_path, subfolder="vae", torch_dtype=dtype)
vae.to("cuda", dtype=dtype)
flush()

print("Loading t5")
tokenizer_2 = T5TokenizerFast.from_pretrained(base_model_path, subfolder="tokenizer_2", torch_dtype=dtype)
text_encoder_2 = T5EncoderModel.from_pretrained(base_model_path, subfolder="text_encoder_2",
                                                torch_dtype=dtype)

text_encoder_2.to("cuda", dtype=dtype)
flush()

print("Loading clip")
text_encoder = CLIPTextModel.from_pretrained(base_model_path, subfolder="text_encoder", torch_dtype=dtype)
tokenizer = CLIPTokenizer.from_pretrained(base_model_path, subfolder="tokenizer", torch_dtype=dtype)
text_encoder.to("cuda", dtype=dtype)

txt2img_pipe = FluxPipeline(
    scheduler=scheduler,
    text_encoder=text_encoder,
    tokenizer=tokenizer,
    text_encoder_2=None,
    tokenizer_2=tokenizer_2,
    vae=vae,
    transformer=None,
).to("cuda")
txt2img_pipe.text_encoder_2 = text_encoder_2
txt2img_pipe.transformer = transformer

The inference code:

generator.manual_seed(seed)
num_inference_steps = 28
max_sequence_length = 256
num_outputs = 1
guidance_scale = 3.5

prompt = "Pink Tweed Crystal Embellished Cropped Jacket with Point Collar, Sparkling Buttoned Placket, and Chest Pockets"

args = {
  "prompt": [f"a product photo of {prompt} on even background"] * num_outputs,
  "guidance_scale": guidance_scale,
  "generator": generator,
  "num_inference_steps": num_inference_steps,
  "max_sequence_length": max_sequence_length,
  "output_type": "pil"
  "height": 1024,
  "width": 1024
}
output = txt2img_pipe(**args)

668357b6-d55f-4c74-9dd6-3a6e42da1fa1_toolkit_str1 0 7dae23d9-42fd-44b3-a9e1-f5d9e6411206_toolkit_str1 0

Logs

No response

System Info

diffusers == 0.31.0.dev0

Who can help?

No response

zetyquickly commented 1 month ago

I observe this as well. Different initialization ways are leading to slightly different results, even with the same seed.

Could anyone suggest a proper way to instantiate the Flux model, especially when we load LoRA?

asomoza commented 1 month ago

Hi, maybe it's because of the T5, it has a problem when casting the weights, related to #8604.

Maybe you can try doing an experiment with the T5 prompt empty and compare the results.

The proper way to load the model is with the from_pretrained. Loading the modules in a separate way is for training or for more specific needs where you need to have them separated.

yiyixuxu commented 1 month ago

@asomoza yes indeed! Especially if you notice option2 is always better than option1?

emil-malina commented 1 month ago

@asomoza @yiyixuxu Will this be resolved if #8604 is resolved?

ScottishFold007 commented 1 month ago

However, when using Flux, there is always a limitation of 77 tokens in length.

asomoza commented 1 month ago

Will this be resolved if https://github.com/huggingface/diffusers/issues/8604 is resolved?

Yes but it's not a diffusers problem, this is a transformer model, so the fix should come from them, the problem is that this doesn't seem to affect the text inference.

So the best fix right now is that you load the T5 using the official recommended method.

However, when using Flux, there is always a limitation of 77 tokens in length.

This isn't related to this issue, the limitation comes from the CLIP model not the T5 and all the training and the official code works like this. You can use sd_embed if you want to circumvent the limit.

emil-malina commented 1 month ago

Got it. Let me put it here.

When loading a Flux model (SD3 or PixArt) pipeline piece by piece in torch.float16 format, do it this way:

Set torch_dtype at loading:

base_model_path = base_model_path = "black-forest-labs/FLUX.1-dev"
dtype = torch.float16
text_encoder = T5EncoderModel.from_pretrained(base_model_path, ..., torch_dtype=dtype)

Instead of casting the model to a dtype afterwards:

base_model_path = base_model_path = "black-forest-labs/FLUX.1-dev"
dtype = torch.float16
text_encoder = T5EncoderModel.from_pretrained(base_model_path)
text_encoder.to(dtype=torch.float16)

emil-malina commented 1 month ago

I assume that will produce the same images, I need to validate that

emil-malina commented 1 month ago

Confirming that. Partial loading and direct loading produce similar results if the pipeline loaded as mentioned above

Partial Loading	Direct Loading

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

asomoza commented 3 weeks ago

closing this issue since the question was answered and the problem was resolved.

huggingface / diffusers