huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.48k stars 5.28k forks source link

SD3 SD3Transformer2DModel issues with setting cross-attention #8539

Closed vladmandic closed 3 months ago

vladmandic commented 3 months ago

Describe the bug

calling set_attn_processor(attention) using AttnProcessor or AttnProcessor2_0
on SD3Transformer2DModel executes without issues, but results in runtime error during inference:

Reproduction

def set_diffusers_attention(pipe):
    def set_attn(pipe, attention):
        if attention is None:
            return
        if not hasattr(pipe, "_get_signature_keys"):
            return
        module_names, _ = pipe._get_signature_keys(pipe) # pylint: disable=protected-access
        modules = [getattr(pipe, n, None) for n in module_names]
        modules = [m for m in modules if isinstance(m, torch.nn.Module) and hasattr(m, "set_attn_processor")]
        for module in modules:
            # if 'SD3Transformer2DModel' in module.__class__.__name__: # TODO Skip SD3 DiT
            #    continue
            module.set_attn_processor(attention)

    if shared.opts.cross_attention_optimization == "Disabled":
        pass # do nothing
    elif shared.opts.cross_attention_optimization == "Scaled-Dot-Product": # The default set by Diffusers
        from diffusers.models.attention_processor import AttnProcessor2_0
        set_attn(pipe, AttnProcessor2_0())
    elif shared.opts.cross_attention_optimization == "xFormers" and hasattr(pipe, 'enable_xformers_memory_efficient_attention'):
        pipe.enable_xformers_memory_efficient_attention()
    elif shared.opts.cross_attention_optimization == "Split attention" and hasattr(pipe, "enable_attention_slicing"):
        pipe.enable_attention_slicing()
    elif shared.opts.cross_attention_optimization == "Batch matrix-matrix":
        from diffusers.models.attention_processor import AttnProcessor
        set_attn(pipe, AttnProcessor())
    elif shared.opts.cross_attention_optimization == "Dynamic Attention BMM":
        from modules.sd_hijack_dynamic_atten import DynamicAttnProcessorBMM
        set_attn(pipe, DynamicAttnProcessorBMM())
    elif shared.opts.cross_attention_optimization == "Dynamic Attention SDP":
        from modules.sd_hijack_dynamic_atten import DynamicAttnProcessorSDP
        set_attn(pipe, DynamicAttnProcessorSDP())

    pipe.current_attn_name = shared.opts.cross_attention_optimization

Logs

/home/vlado/dev/sdnext/venv/lib/python3.12/site-packages/diffusers/models/attention.py:196 in forward
❱ 196 encoder_hidden_states = encoder_hidden_states + context_attn_output
RuntimeError: The size of tensor a (154) must match the size of tensor b (1024) at non-singleton dimension 1

System Info

ubuntu 24.04
diffusers==0.29.0
torch==2.3.1
cuda==12.1

Who can help?

@yiyixuxu @sayakpaul @DN6

sayakpaul commented 3 months ago

I don’t think this is a fully reproducible code snippet to be honest. We don’t know where does shared come from to be honest.

Also, I think you would want to use the JointAttnProcessor2_0 when using MMDiT.

vladmandic commented 3 months ago

ah, i missed that it should use JointAttnProcessor2_0 - thanks - closing.

regarding reproducibility, shared.opts.cross_attention_optimization is just a string so its easier to set desired attention mechanism and test different ones.