guoqincode / Open-AnimateAnyone

Unofficial Implementation of Animate Anyone
2.89k stars 233 forks source link

Why are ReferenceNet features used to query in reader mode? #72

Open Sireer opened 8 months ago

Sireer commented 8 months ago
hidden_states_uc = self.attn1(modify_norm_hidden_states, 
encoder_hidden_states=modify_norm_hidden_states,
attention_mask=attention_mask)[:,:hidden_states.shape[-2],:] + hidden_states

# hidden_states_uc = self.attn1(norm_hidden_states, 
# encoder_hidden_states=torch.cat([norm_hidden_states] + self.bank, dim=1),
# attention_mask=attention_mask) + hidden_states

Why the codes from magic animate are commented? Is there any problem with this?

guoqincode commented 8 months ago

Hi, you can look at the description of spatial-attention in the paper, we just need to take the first half.

Sireer commented 8 months ago

Yes, but we do not need to compute the last half. It seems that the code from Magic Animate can get the first half and it does not need to comput the second half.

hidden_states_uc = self.attn1(norm_hidden_states, 
  encoder_hidden_states=torch.cat([norm_hidden_states] + self.bank, dim=1),
  attention_mask=attention_mask) + hidden_states
luyvlei commented 8 months ago

Indeed the implementation of magic animate is equivalent to the "concat then split" operation of animateanyone. The "query" don't need to be concated, concating the query and then get the first half just adds the computational overhead. (It will not affect both the training and inference process)

For trained models, inference with the "magic animate style" code also works well.