Open Sireer opened 10 months ago
Hi, you can look at the description of spatial-attention in the paper, we just need to take the first half.
Yes, but we do not need to compute the last half. It seems that the code from Magic Animate can get the first half and it does not need to comput the second half.
hidden_states_uc = self.attn1(norm_hidden_states,
encoder_hidden_states=torch.cat([norm_hidden_states] + self.bank, dim=1),
attention_mask=attention_mask) + hidden_states
Indeed the implementation of magic animate is equivalent to the "concat then split" operation of animateanyone. The "query" don't need to be concated, concating the query and then get the first half just adds the computational overhead. (It will not affect both the training and inference process)
For trained models, inference with the "magic animate style" code also works well.
Why the codes from magic animate are commented? Is there any problem with this?