Closed Wuchuq closed 1 year ago
Hi, thank you for your interest in our work! We add forward hooks to all cross-attention layers in the unet_store_cross_attention_scores
function (https://github.com/mit-han-lab/fastcomposer/blob/main/fastcomposer/model.py#L280), so that the cross_attention_scores
will be filled after each forward pass.
Best, Guangxuan
Got it ! Thank you.
Sorry to bother you again. I observed that the shape of cross attention map in each layer is [batch * head, latent_shape, num_token], so how can we obtain the attention map shown in Figure 4 ?
Thank you for the interest. We follow P2P to average the attention map across attention heads and denoising time steps. And we choose the layer up_blocks.1.attentions.0.transformer_blocks.0.attn2
as it tends to have the highest correlation with the semantics of an image.
Understand! thanks
unet.up_blocks[1].attentions[0].transformer_blocks[0].attn2(hidden_states = text_embeds) where text_embeds are achieved from clip model and it shape is [batch_size, n, 768] Eg: shape of hidden_states = torch.Size([1, 10, 768])
Then it is giving an error "RuntimeError: mat1 and mat2 shapes cannot be multiplied (10x768 and 1280x1280)"
Thank you for your great work! I find that the cross_attention_scores is first initialized as {} in model.py but not assigned later, which results in error in calculating the object localization loss function. Can you tell me how to deal with that?