mlfoundations / open_flamingo

An open-source framework for training large multimodal models.
MIT License
3.68k stars 277 forks source link

what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames? #301

Open Yang-bug-star opened 3 months ago

Yang-bug-star commented 3 months ago

According to the original paper, the input shape to PerceiverResampler should be (b, T, v, d),T means the number of frames in video in time and v means the number of visual tokens of one frame. But I'm confused about the concept of media