In AnimateAnyone paper, attention1 is responsible for spatial-attention operation,encoder_hidden_states is embedded in attention2, and Is it correct to apply classifier-free guidance to attention2?
Or for image, is it only necessary to set uncodition image_embeddings to 0 before input unet?
https://github.com/guoqincode/AnimateAnyone-unofficial/blob/main/models/ReferenceNet_attention.py#L153
In AnimateAnyone paper,
attention1
is responsible for spatial-attention operation,encoder_hidden_states
is embedded inattention2
, and Is it correct to apply classifier-free guidance toattention2
? Or for image, is it only necessary to setuncodition image_embeddings
to 0 before input unet?