mit-han-lab / fastcomposer

[IJCV] FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention
https://fastcomposer.mit.edu
MIT License
669 stars 38 forks source link

Where to assign value for "cross_attention_scores" when calculating the object localization loss? #7

Closed Wuchuq closed 1 year ago

Wuchuq commented 1 year ago
image

Thank you for your great work! I find that the cross_attention_scores is first initialized as {} in model.py but not assigned later, which results in error in calculating the object localization loss function. Can you tell me how to deal with that?

Guangxuan-Xiao commented 1 year ago

Hi, thank you for your interest in our work! We add forward hooks to all cross-attention layers in the unet_store_cross_attention_scores function (https://github.com/mit-han-lab/fastcomposer/blob/main/fastcomposer/model.py#L280), so that the cross_attention_scores will be filled after each forward pass.

Best, Guangxuan

Wuchuq commented 1 year ago

Got it ! Thank you.

Wuchuq commented 1 year ago

Sorry to bother you again. I observed that the shape of cross attention map in each layer is [batch * head, latent_shape, num_token], so how can we obtain the attention map shown in Figure 4 ?

image
tianweiy commented 1 year ago

Thank you for the interest. We follow P2P to average the attention map across attention heads and denoising time steps. And we choose the layer up_blocks.1.attentions.0.transformer_blocks.0.attn2 as it tends to have the highest correlation with the semantics of an image.

Wuchuq commented 1 year ago

Understand! thanks

Pratyushk2003 commented 6 months ago

unet.up_blocks[1].attentions[0].transformer_blocks[0].attn2(hidden_states = text_embeds) where text_embeds are achieved from clip model and it shape is [batch_size, n, 768] Eg: shape of hidden_states = torch.Size([1, 10, 768])

Then it is giving an error "RuntimeError: mat1 and mat2 shapes cannot be multiplied (10x768 and 1280x1280)"