mit-han-lab / vila-u

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
MIT License
154 stars 3 forks source link

Generating video. #7

Open SekeunKim opened 1 week ago

SekeunKim commented 1 week ago

In this paper, it looks it can generate next token and decode to image for video generation. In demo, it only has generating image based on text. Is that correct ?

Thank you for great paper.