Closed PanXiebit closed 1 year ago
Hi @PanXiebit about (1) the text unimodal does have the causal attention mask, you can see it here https://github.com/mlfoundations/open_clip/blob/197cf453576534386ca32431828ed701c1e01c45/src/open_clip/transformer.py#L613 while about (2) that is fine-tuning specifically for captioning and setting the clip loss to zero is how it is done in the original paper. I hope this helps :)
@gpucce Thanks!
Thanks for your great works. I have several questions on the coca model.