swiss-ai / ml-4m

4M: Massively Multimodal Masked Modeling (NeurIPS 2023 Spotlight)
Apache License 2.0
0 stars 0 forks source link

[PARENT ISSUE] Implement the temporal changes in 4M to account for video #2

Open kdu4108 opened 2 weeks ago

kdu4108 commented 2 weeks ago

Implement the model according to this design: https://docs.google.com/presentation/d/1AY3QV1N_hoi9aXI1r8QTqrNmDK9LyorgJDQMPWb8hBo/edit#slide=id.g2e696416940_0_144.

This includes (at least) several steps, each which will be detailed in its own github issue/PR:

garjania commented 1 week ago

Considering the RGB frames, before adding anything to modality_info or modality_transform, we need to tokenize them. So I suggest to also include the RGB tokenization step for the video datasets somewhere along the first steps.

kdu4108 commented 4 days ago

(why?) -- We need to tokenize RGB (and all other vision-like modalities) because they can be inputted as tokens to the model. (in fact, RGB is the only one which allows for pixel-patches which would not require tokenization)