Open kdu4108 opened 2 weeks ago
Considering the RGB frames, before adding anything to modality_info or modality_transform, we need to tokenize them. So I suggest to also include the RGB tokenization step for the video datasets somewhere along the first steps.
(why?) -- We need to tokenize RGB (and all other vision-like modalities) because they can be inputted as tokens to the model. (in fact, RGB is the only one which allows for pixel-patches which would not require tokenization)
Implement the model according to this design: https://docs.google.com/presentation/d/1AY3QV1N_hoi9aXI1r8QTqrNmDK9LyorgJDQMPWb8hBo/edit#slide=id.g2e696416940_0_144.
This includes (at least) several steps, each which will be detailed in its own github issue/PR: