showlab / Tune-A-Video

[ICCV 2023] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
https://tuneavideo.github.io
Apache License 2.0
4.15k stars 377 forks source link

Question about fine-tuning. #59

Closed RuoyuFeng closed 1 year ago

RuoyuFeng commented 1 year ago

Hi, thanks for your great work! I'm confused about that if it's required to fine-tune the model for each input model. What is the purpose of fine-tuning? Is it not enough to simply conduct cross-attention among current frames and previous to keep the consistency?

zhangjiewu commented 1 year ago

the purpose of fine-tuning is to improve video consistency. simply applying cross-frame attention can result in poor temporal consistency (e.g., missing frames), as shown in the comparison below.

Input Video w/o finetuning Tune-A-Video
"A jeep car is moving on the road" "A jeep car is moving on the beach"
"A rabbit is eating a watermelon on the table" "A cat with sunglasses is eating a watermelon on the beach"
RuoyuFeng commented 1 year ago

the purpose of fine-tuning is to improve video consistency. simply applying cross-frame attention can result in poor temporal consistency (e.g., missing frames), as shown in the comparison below.

Input Video w/o finetuning Tune-A-Video "A jeep car is moving on the road" "A jeep car is moving on the beach" "A rabbit is eating a watermelon on the table" "A cat with sunglasses is eating a watermelon on the beach"

Thank you so much for your patient reply and visualization ! Another question is that do we need to fine-tune for each input/reference video? Or we just fine-tune the model on some videos and the model can generate different videos?

zhangjiewu commented 1 year ago

in our setting, we fine-tune the pretrained T2I model for each video. however, the approach itself could be applied to other scenarios, for example, few-shot videos as you mentioned. feel free to explore more :)

RuoyuFeng commented 1 year ago

in our setting, we fine-tune the pretrained T2I model for each video. however, the approach itself could be applied to other scenarios, for example, few-shot videos as you mentioned. feel free to explore more :)

Thank you for taking the time to answer my question. Your response was truly insightful and provided me with valuable information. Your work is greatly appreciated!

RuoyuFeng commented 1 year ago

Hello again, I have read your paper once again carefully, it's a pretty great work and brought me lots of insights. I have another question about one detail of your framework design.

In Figure 5 of the paper, why is it necessary to conduct cross-attention on the first frame and the former frame? Why not just use the former two frames (like v_i-1 and v_i-2 instead of v_i-1 and v_1) ? What's the purpose to use the first frame as a condition here, is it used to keep the global consistency so that we can avoid the problem of error propagation?

Thanks again for your great job and patiently answering my question!

zhangjiewu commented 1 year ago

thank you for acknowledging our work! i'm delighted to hear that it has inspired you.

yes, we use the first frame as an anchor for global consistency. using former two frames without the first frame would result in error propagation (visual artifact).

RuoyuFeng commented 1 year ago

thank you for acknowledging our work! i'm delighted to hear that it has inspired you.

yes, we use the first frame as an anchor for global consistency. using former two frames without the first frame would result in error propagation (visual artifact).

Thank you so much for your detailed and insightful reply. Your expertise has been invaluable and I appreciate your willingness to share your knowledge with me!