Implement the encoder embeddings to encode which frame it is (a temporal embedding, in addition to the patch position and modality embeddings).

swiss-ai / ml-4m

4M: Massively Multimodal Masked Modeling (NeurIPS 2023 Spotlight)

Apache License 2.0

0 stars 0 forks source link

According to https://docs.google.com/presentation/d/1AY3QV1N_hoi9aXI1r8QTqrNmDK9LyorgJDQMPWb8hBo/edit#slide=id.g2e696416940_0_144, we have to add the temporal/frame encoding to IMAGE-based modality embeddings (but not sequence based ones).

Things to consider: make sure the embedding for temporal frame doesn't interfere with the positional patch embedding somehow?

Definition of Done: all image based encoder embeddings are augmented with a temporal embedding.

@vesteinn @garjania

swiss-ai / ml-4m