swiss-ai / ml-4m

4M: Massively Multimodal Masked Modeling (NeurIPS 2023 Spotlight)
Apache License 2.0
0 stars 0 forks source link

Implement the encoder embeddings to encode which frame it is (a temporal embedding, in addition to the patch position and modality embeddings). #4

Open kdu4108 opened 2 weeks ago

kdu4108 commented 2 weeks ago

According to https://docs.google.com/presentation/d/1AY3QV1N_hoi9aXI1r8QTqrNmDK9LyorgJDQMPWb8hBo/edit#slide=id.g2e696416940_0_144, we have to add the temporal/frame encoding to IMAGE-based modality embeddings (but not sequence based ones).

A good starting point: check out this https://github.com/swiss-ai/ml-4m/blob/4c2c9a56a9e2cd3e94316e766028c71bb6e248d8/fourm/models/encoder_embeddings.py#L206 and kinda do the same but with an extra temporal embedding?

Things to consider: make sure the embedding for temporal frame doesn't interfere with the positional patch embedding somehow?

Definition of Done: all image based encoder embeddings are augmented with a temporal embedding.

@vesteinn @garjania

kdu4108 commented 5 days ago

Suggestion for aligning temp embeddings across modalities: when you make the embedding sum for one modality, e.g., Frame 0 (RGB): x + pos_emb + temp_emb + mod_emb, and for another one, e.g., <frame0…>: x + pos_emb + mod_emb + temp_emb make sure the temp_emb is the same for those two different modalities if the position is the same