Closed Chenyan722 closed 7 months ago
I am not confident whether I understood your question correctly, but the code is for replacing the original keyword tokens with the projected text embeddings.
Namely,
caption = gray cat sleeps on a pillow
keyword masked caption = $ sleeps on $
replaced caption =
259
is the token index for $
(special token)
Yes, in this example, the projected text embedding is output from the phi
, i.e., pseudo_tokens,maybe the projected text embedding represent the global semantic of the caption. And in the line 32-34 of encode_with_pseudo_tokens.py, the same text embedding are injected into the position of $. I don't know if my explanation is clear.
pseudo_tokens
is injected as the replaced token, and we use the phi output as the pseudo_tokens.
Please check torch.where document for details.
Thank you. I know how to use torch.where, and I understand that the pseudo_tokens is output from phi.
However, in the line 32-34 of encode_with_pseudo_tokens.py, the position that meet the condition of text.unsqueeze(-1) == 259
("keywords") is replaced by pseudo_tokens. right? My concern is, the same pseudo_tokens are injected into different keywords. In your example, keyword masked caption = $ sleeps on $
, two $ are replaced by the same pseudo_token since the dimension of pseudo_token output from phi is [bs, 768].
Thanks again for your patience!
Using the same pseudo token is not a bug. Please check our paper for details.
Also, we have checked that using multiple tokens severely drops performance. (e.g., 24.66 -> 22.79 in CIRR dev R@1). This experiment will be included in the revision.
Thank you. I know how to use torch.where, and I understand that the pseudo_tokens is output from phi.
However, in the line 32-34 of encode_with_pseudo_tokens.py, the position that meet the condition of
text.unsqueeze(-1) == 259
("keywords") is replaced by pseudo_tokens. right? My concern is, the same pseudo_tokens are injected into different keywords. In your example,keyword masked caption = $ sleeps on $
, two $ are replaced by the same pseudo_token since the dimension of pseudo_token output from phi is [bs, 768].Thanks again for your patience!
I understand your concern. I also meet the same question, since the method inserts the same pseudo_token to all the $ of the entire caption. That is to say, the replaced caption with several same pseudo_token is input to the text encoder. However, the results are amazing!
Great work! In the line 32 of encode_with_pseudo_tokens.py, i.e., https://github.com/navervision/lincir/blob/6ffbdebb665878285afcb8f5263a1f8a44937ad4/encode_with_pseudo_tokens.py#L32. Why input the text embedding of the caption? and the dimension of pseudo_tokens is [bs, 768], while the dimension of x is [bs, 77, 768]. Thus, why input the embedding of the single pseudo_tokens to each position of mask with the same embedding? And then x will be input to the clip text emcoder. The logic seems to be incomprehensible. Looking forward to your reply!