Open kdu4108 opened 4 months ago
Suggestion for aligning temp embeddings across modalities: when you make the embedding sum for one modality, e.g., Frame 0 (RGB): x + pos_emb + temp_emb + mod_emb, and for another one, e.g., <frame0…>: x + pos_emb + mod_emb + temp_emb make sure the temp_emb is the same for those two different modalities if the position is the same
According to https://docs.google.com/presentation/d/1AY3QV1N_hoi9aXI1r8QTqrNmDK9LyorgJDQMPWb8hBo/edit#slide=id.g2e696416940_0_144, we have to add the temporal/frame encoding to IMAGE-based modality embeddings (but not sequence based ones).
A good starting point: check out this https://github.com/swiss-ai/ml-4m/blob/4c2c9a56a9e2cd3e94316e766028c71bb6e248d8/fourm/models/encoder_embeddings.py#L206 and kinda do the same but with an extra temporal embedding?
Things to consider: make sure the embedding for temporal frame doesn't interfere with the positional patch embedding somehow?
Definition of Done: all image based encoder embeddings are augmented with a temporal embedding.
@vesteinn @garjania