[AAAI 2024] Follow-Your-Pose: This repo is the official implementation of "Follow-Your-Pose : Pose-Guided Text-to-Video Generation using Pose-Free Videos"
As the author mentioned in Abstract: In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks.
Can I understand the cross-frame attn mentioned in your paper is the SparseCausalAttention Class in your opened-source codes, which is the same as the SparseCausalAttention Class writen in Tune-A-Video? In this case, how does the Cross-Frame Attn reformed in your project? Which part of the code is embodied?
Yes, It is the same as the SparseCausalAttention Class writen in Tune-A-Video.
We finetune the SCA on HDVILA and add lora to keep consistancy.
As for code, you could find it in here
As the author mentioned in Abstract: In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks.
Can I understand the cross-frame attn mentioned in your paper is the SparseCausalAttention Class in your opened-source codes, which is the same as the SparseCausalAttention Class writen in Tune-A-Video? In this case, how does the Cross-Frame Attn reformed in your project? Which part of the code is embodied?