What parts of Cross-Frame Attention have been reformed in your project relative to Tune-A-Video？

mayuelala / FollowYourPose

[AAAI 2024] Follow-Your-Pose: This repo is the official implementation of "Follow-Your-Pose : Pose-Guided Text-to-Video Generation using Pose-Free Videos"

MIT License

1.23k stars 87 forks source link

As the author mentioned in Abstract: In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks.

Can I understand the cross-frame attn mentioned in your paper is the SparseCausalAttention Class in your opened-source codes, which is the same as the SparseCausalAttention Class writen in Tune-A-Video? In this case, how does the Cross-Frame Attn reformed in your project? Which part of the code is embodied?

mayuelala / FollowYourPose

What parts of Cross-Frame Attention have been reformed in your project relative to Tune-A-Video？ #20