why text unimodal of coca don't use casual mask self-attention?

PanXiebit commented 1 year ago

Thanks for your great works. I have several questions on the coca model.

In the original paper, both the unimoal and the multimodal use causally-masked self-attention. However, the implement of unimodal in this repo use clip. If you donnot use causally-masked, the caption loss to be improperly computed since later words can see previous words? "That is, the bottom n_uni unimodal decoder layers encode the input text as latent vectors with causally-masked self-attention, and the top n_multi multimodal layers further apply causally-masked self-attention and together with cross-attention to the output of the visual encoder. " (from original paper.)

In the README, the finetune script of finetuning coca model set "coca-contrastive-loss-weight 0", why not use clip-loss?

gpucce commented 1 year ago

Hi @PanXiebit about (1) the text unimodal does have the causal attention mask, you can see it here https://github.com/mlfoundations/open_clip/blob/197cf453576534386ca32431828ed701c1e01c45/src/open_clip/transformer.py#L613 while about (2) that is fine-tuning specifically for captioning and setting the clip loss to zero is how it is done in the original paper. I hope this helps :)

PanXiebit commented 1 year ago

@gpucce Thanks!

mlfoundations / open_clip