Difference between AnimateDiff and Hotshot

Hi!

In terms of OS t2v models, Modelscope, AnimateDiff, & Hotshot are all using a double self attention architecture. Modelscope is trained as a full text-to-video model, AnimateDiff & Hotshot are built on top of text-to-image models because those are already trained on billions of images + a video is just a series of images. Building on top of existing text-to-image models is advantageous because it's super hard to find/create good video data & HQ image data is much more abundant for all the world's different concepts/language. You can read more about another approach to building on top of a t2i model here: https://arxiv.org/abs/2304.08818

AnimateDiff is a model trained from scratch based on SD 1.5 Hotshot is a different model trained from scratch based on SDXL

AnimateDiff was trained on a dataset called webvid from Shutterstock (lots of slow motion stock video + full shutterstock watermarks) Hotshot was trained on a different dataset. IMO, in conjunction w SDXL this is what makes the motion/language understanding much better

hotshotco / Hotshot-XL

Difference between AnimateDiff and Hotshot #13