Closed gpucce closed 1 year ago
@gpucce i would be surprised if it made a difference, as the normalized embeddings are projected separately for contrastive and cross attention
but best way to know is just to run the experiments and see! you are in a great position to do that given open clip :smile:
Hi, if I understand correctly, there is a single
LayerNorm
that is applied to all the queries output by the attentional pooler, however in the paper it seems like they use a different one for the one query used by the contrastive loss and those that are used as context for the multimodal part. Does it make a difference or is it the same or am I just wrong?