Why are ReferenceNet features used to query in reader mode?

guoqincode / Open-AnimateAnyone

Unofficial Implementation of Animate Anyone

2.94k stars 237 forks source link

Why are ReferenceNet features used to query in reader mode? #72

Open Sireer opened 10 months ago

Sireer commented 10 months ago

hidden_states_uc = self.attn1(modify_norm_hidden_states, 
encoder_hidden_states=modify_norm_hidden_states,
attention_mask=attention_mask)[:,:hidden_states.shape[-2],:] + hidden_states

# hidden_states_uc = self.attn1(norm_hidden_states, 
# encoder_hidden_states=torch.cat([norm_hidden_states] + self.bank, dim=1),
# attention_mask=attention_mask) + hidden_states

Why the codes from magic animate are commented? Is there any problem with this?

guoqincode commented 10 months ago

Hi, you can look at the description of spatial-attention in the paper, we just need to take the first half.

Sireer commented 10 months ago

Yes, but we do not need to compute the last half. It seems that the code from Magic Animate can get the first half and it does not need to comput the second half.

hidden_states_uc = self.attn1(norm_hidden_states, 
  encoder_hidden_states=torch.cat([norm_hidden_states] + self.bank, dim=1),
  attention_mask=attention_mask) + hidden_states

luyvlei commented 10 months ago

Indeed the implementation of magic animate is equivalent to the "concat then split" operation of animateanyone. The "query" don't need to be concated, concating the query and then get the first half just adds the computational overhead. (It will not affect both the training and inference process)

For trained models, inference with the "magic animate style" code also works well.