huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.1k stars 5.38k forks source link

there is no qk_norm in SD3Transformer2DModel. Is that right? #8906

Closed heart-du closed 1 week ago

heart-du commented 3 months ago

Describe the bug

there is no qk_norm in SD3Transformer2DModel. Is that right?

    self.attn = Attention(
        query_dim=dim,
        cross_attention_dim=None,
        added_kv_proj_dim=dim,
        dim_head=attention_head_dim // num_attention_heads,
        heads=num_attention_heads,
        out_dim=attention_head_dim,
        context_pre_only=context_pre_only,
        bias=True,
        processor=processor,
    )

Reproduction

1.

Logs

No response

System Info

29.2

Who can help?

dukunpeng

tolgacangoz commented 3 months ago

The paper says that the RMS-Norm for Q and K can be added to stabilize training runs. And, they observed the instability caused by not normalizing Q and K not fully but partly -at the last transformer blocks of the network. Maybe for this reason, it wasn't added. AFAIU this might not be a compulsory thing to use but might be optional. See the paper for details. Additionally, Phil Wang preferred qk_rmsnorm = False by default too. Cc: @DN6 @yiyixuxu

Btw, the code you shared is not the last version of it; see it in the repo.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

a-r-r-o-w commented 1 week ago

Thanks for addressing @tolgacangoz! I think this can now be marked closed, but feel free to re-open @heart-du if something needs addressing