An open-source framework for training large multimodal models.
MIT License
3.74k
stars
284
forks
source link
what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames? #301
According to the original paper, the input shape to PerceiverResampler should be (b, T, v, d),T means the number of frames in video in time and v means the number of visual tokens of one frame. But I'm confused about the concept of media
According to the original paper, the input shape to PerceiverResampler should be (b, T, v, d),T means the number of frames in video in time and v means the number of visual tokens of one frame. But I'm confused about the concept of media