what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames?

mlfoundations / open_flamingo

An open-source framework for training large multimodal models.

MIT License

3.74k stars 284 forks source link

what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames? #301

Open Yang-bug-star opened 5 months ago

Yang-bug-star commented 5 months ago

According to the original paper, the input shape to PerceiverResampler should be (b, T, v, d)，T means the number of frames in video in time and v means the number of visual tokens of one frame. But I'm confused about the concept of media