[PARENT ISSUE] Implement the temporal changes in 4M to account for video

kdu4108 commented 2 weeks ago

Implement the model according to this design: https://docs.google.com/presentation/d/1AY3QV1N_hoi9aXI1r8QTqrNmDK9LyorgJDQMPWb8hBo/edit#slide=id.g2e696416940_0_144.

This includes (at least) several steps, each which will be detailed in its own github issue/PR:

Determine the correct format for storing each video modality and implement pseudolabelers/data downloaders/etc. to get the video data stored in parallel directories as usable by 4M and video2dataset. "Definition of done" here means we have the data in the right directories and we can load them in the correct format. (https://github.com/swiss-ai/ml-4m/issues/3)
Implement modality_info and modality_transforms to map the new video modalities to their transformations which prepare them from the input filetype to be passable downstream to the model. (WIP PR: https://github.com/swiss-ai/ml-4m/pull/1)
Implement the encoder embeddings to be encode frame position (in addition to the patch position and modality embeddings). (https://github.com/swiss-ai/ml-4m/issues/4)
Implement a masking strategy which masks consistently across temporal frames (i.e., if you mask out patch position 7 for one frame, do it for all frames in that clip). (https://github.com/swiss-ai/ml-4m/issues/5)
TODO? @garjania what other steps are required here? anything for decoder embeddings?

garjania commented 1 week ago

Considering the RGB frames, before adding anything to modality_info or modality_transform, we need to tokenize them. So I suggest to also include the RGB tokenization step for the video datasets somewhere along the first steps.

kdu4108 commented 4 days ago

(why?) -- We need to tokenize RGB (and all other vision-like modalities) because they can be inputted as tokens to the model. (in fact, RGB is the only one which allows for pixel-patches which would not require tokenization)

swiss-ai / ml-4m

[PARENT ISSUE] Implement the temporal changes in 4M to account for video #2