lucidrains / CoCa-pytorch

Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch
MIT License
1.04k stars 88 forks source link

LayerNorm after attentional_pooler #13

Closed gpucce closed 1 year ago

gpucce commented 1 year ago

Hi, if I understand correctly, there is a single LayerNorm that is applied to all the queries output by the attentional pooler, however in the paper it seems like they use a different one for the one query used by the contrastive loss and those that are used as context for the multimodal part. Does it make a difference or is it the same or am I just wrong?

lucidrains commented 1 year ago

@gpucce i would be surprised if it made a difference, as the normalized embeddings are projected separately for contrastive and cross attention

but best way to know is just to run the experiments and see! you are in a great position to do that given open clip :smile: