hotshotco / Hotshot-XL

✨ Hotshot-XL: State-of-the-art AI text-to-GIF model trained to work alongside Stable Diffusion XL
https://hotshot.co
Apache License 2.0
1.04k stars 81 forks source link

Difference between AnimateDiff and Hotshot #13

Closed MaxLeung99 closed 11 months ago

MaxLeung99 commented 11 months ago

Hi, thanks for sharing this great work. I just wonder what the difference is between AnimateDiff and Hotshot. Both models train a temporal attention module and freeze the original SD. It seems that the training pipeline shares a lot of similarities.

aakashs commented 11 months ago

Hi!

In terms of OS t2v models, Modelscope, AnimateDiff, & Hotshot are all using a double self attention architecture. Modelscope is trained as a full text-to-video model, AnimateDiff & Hotshot are built on top of text-to-image models because those are already trained on billions of images + a video is just a series of images. Building on top of existing text-to-image models is advantageous because it's super hard to find/create good video data & HQ image data is much more abundant for all the world's different concepts/language. You can read more about another approach to building on top of a t2i model here: https://arxiv.org/abs/2304.08818

AnimateDiff is a model trained from scratch based on SD 1.5 Hotshot is a different model trained from scratch based on SDXL

AnimateDiff was trained on a dataset called webvid from Shutterstock (lots of slow motion stock video + full shutterstock watermarks) Hotshot was trained on a different dataset. IMO, in conjunction w SDXL this is what makes the motion/language understanding much better