navervision / lincir

Official Pytorch implementation of LinCIR: Language-only Training of Zero-shot Composed Image Retrieval (CVPR 2024)
Other
100 stars 5 forks source link

Maybe a code bug? #12

Closed Chenyan722 closed 7 months ago

Chenyan722 commented 7 months ago

Great work! In the line 32 of encode_with_pseudo_tokens.py, i.e., https://github.com/navervision/lincir/blob/6ffbdebb665878285afcb8f5263a1f8a44937ad4/encode_with_pseudo_tokens.py#L32. Why input the text embedding of the caption? and the dimension of pseudo_tokens is [bs, 768], while the dimension of x is [bs, 77, 768]. Thus, why input the embedding of the single pseudo_tokens to each position of mask with the same embedding? And then x will be input to the clip text emcoder. The logic seems to be incomprehensible. Looking forward to your reply!

SanghyukChun commented 7 months ago

I am not confident whether I understood your question correctly, but the code is for replacing the original keyword tokens with the projected text embeddings.

Namely, caption = gray cat sleeps on a pillow keyword masked caption = $ sleeps on $ replaced caption = sleeps on

259 is the token index for $ (special token)

Chenyan722 commented 7 months ago

Yes, in this example, the projected text embedding is output from the phi, i.e., pseudo_tokens,maybe the projected text embedding represent the global semantic of the caption. And in the line 32-34 of encode_with_pseudo_tokens.py, the same text embedding are injected into the position of $. I don't know if my explanation is clear.

SanghyukChun commented 7 months ago

https://github.com/navervision/lincir/blob/6ffbdebb665878285afcb8f5263a1f8a44937ad4/train_phi.py#L237-L239

pseudo_tokens is injected as the replaced token, and we use the phi output as the pseudo_tokens.

Please check torch.where document for details.

Chenyan722 commented 7 months ago

Thank you. I know how to use torch.where, and I understand that the pseudo_tokens is output from phi.

However, in the line 32-34 of encode_with_pseudo_tokens.py, the position that meet the condition of text.unsqueeze(-1) == 259 ("keywords") is replaced by pseudo_tokens. right? My concern is, the same pseudo_tokens are injected into different keywords. In your example, keyword masked caption = $ sleeps on $, two $ are replaced by the same pseudo_token since the dimension of pseudo_token output from phi is [bs, 768].

Thanks again for your patience!

SanghyukChun commented 7 months ago

Using the same pseudo token is not a bug. Please check our paper for details.

Also, we have checked that using multiple tokens severely drops performance. (e.g., 24.66 -> 22.79 in CIRR dev R@1). This experiment will be included in the revision.

Pefect96 commented 7 months ago

Thank you. I know how to use torch.where, and I understand that the pseudo_tokens is output from phi.

However, in the line 32-34 of encode_with_pseudo_tokens.py, the position that meet the condition of text.unsqueeze(-1) == 259 ("keywords") is replaced by pseudo_tokens. right? My concern is, the same pseudo_tokens are injected into different keywords. In your example, keyword masked caption = $ sleeps on $, two $ are replaced by the same pseudo_token since the dimension of pseudo_token output from phi is [bs, 768].

Thanks again for your patience!

I understand your concern. I also meet the same question, since the method inserts the same pseudo_token to all the $ of the entire caption. That is to say, the replaced caption with several same pseudo_token is input to the text encoder. However, the results are amazing!