Closed 9B8DY6 closed 1 year ago
@9B8DY6 they don't actually produce the entire video from one large prompt at once
they are basically generating one scene at a time, with the next scene conditioned on the last few frames of the previous scene
in other words, you are always cross attending from all the video tokens to the text, within a scene
In cvivit, you would train video clip with the fixed number of frames. When training maskgit to do cross attention with text tokens, how did you cut(?) corresponding text tokens for the given frames? Thank you!