I have a theoretical question about maskgit cross attention

lucidrains / phenaki-pytorch

Implementation of Phenaki Video, which uses Mask GIT to produce text guided videos of up to 2 minutes in length, in Pytorch

MIT License

748 stars 79 forks source link

I have a theoretical question about maskgit cross attention #23

Closed 9B8DY6 closed 1 year ago

9B8DY6 commented 1 year ago

In cvivit, you would train video clip with the fixed number of frames. When training maskgit to do cross attention with text tokens, how did you cut(?) corresponding text tokens for the given frames? Thank you!

lucidrains commented 1 year ago

@9B8DY6 they don't actually produce the entire video from one large prompt at once

they are basically generating one scene at a time, with the next scene conditioned on the last few frames of the previous scene

lucidrains commented 1 year ago

in other words, you are always cross attending from all the video tokens to the text, within a scene