why train VIT visual encoder first?

lucidrains / CoCa-pytorch

Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch

MIT License

1.04k stars 88 forks source link

why train VIT visual encoder first? #4

Closed Flowerfan closed 2 years ago

Flowerfan commented 2 years ago

Hi, thanks for sharing this repo. In the CoCA paper, both the visual encoder and text encoder are end-to end trained. But in this repo, the vit is first pretrained then fixed to train CoCa.

lucidrains commented 2 years ago

@Flowerfan oh yes, i do believe you are correct https://github.com/lucidrains/CoCa-pytorch/commit/4a6dbccb9b08d49229b378c4496c514f9a6ab427 i must have been thinking about Flamingo at the moment

thank you for pointing this out!