Open handsomelys opened 1 week ago
Exactly. They are just random initialized vanilla positional embeddings.
These paired embeddings share the same weights to label the corresponding text paired datasets
These paired embeddings share the same weights to label the corresponding text paired datasets
Thanks Reply!
Thank you very much for your enthusiastic reply. But I still have some questions about pretraining objectives: Are the two terms in Formula 4 not equivalent? Why?
What is the value of predictions $p_v
$ in Formula 5? Is the value of $p_v
$ obtained directly through an MLP layer or will it go through an activation function similar to sigmoid?
How is the conditional causal masked mentioned in Formula 6 done? Is it to mask the last 60% of all tokens, and then use BERT to reconstruct the masked tokens in an autoregressive manner ?
Because they' re two matrix of text and multimodal features. Their dot products are transposed matrix. So for the columns and rows, the summations are different, especially dealing with a huge batch size.
The match process is provided in our released code:
https://github.com/invictus717/MiCo/blob/89c91c9dac68125a18a1a966bd80f9e74e584e80/model/mico.py#L44
The causal pretraining process is exactly as you say, which is intuitive and simple.
If you have any further questions, please feel free to reach out.
If you have any further questions, please feel free to reach out.
Thanks again for your reply!
How do I understand $
E_{Sam}
$ and the corresponding $E^{T-I}_{Sam}
$ in the paper? Is it constructed using the positional embedding in the transformer like the learnable embedding $E_{Pos}
$ etc. mentioned above?