CoCa multimodal transformer layer implementation

mlfoundations / open_clip

An open source implementation of CLIP.

Other

10.12k stars 971 forks source link

CoCa multimodal transformer layer implementation #571

Open ebsmothers opened 1 year ago

ebsmothers commented 1 year ago

Hi, thanks for your CoCa implementation! I have a question on the multimodal transformer: typically in a decoder layer I would expect to see self-attention, then cross-attention, then an MLP. But it seems like here a single layer is actually doing self-attention, MLP, cross-attention, then another MLP (since both resblock and cross_attn have an MLP). Is there a specific reason for doing it this way? Thanks in advance.

gpucce commented 1 year ago

Hi, @ebsmothers the main reason is that this was mostly inspired by https://github.com/lucidrains/CoCa-pytorch/blob/main/coca_pytorch/coca_pytorch.py which uses parallel feedforward instead of the classic one both in self and cross attention.