The top-k operation in fine-grained contrastive loss.

mengcaopku / LocVTP

[ECCV 22] LocVTP: Video-Text Pre-training for Temporal Localization

Apache License 2.0

38 stars 0 forks source link

The top-k operation in fine-grained contrastive loss. #2

Open LukeForeverYoung opened 2 years ago

LukeForeverYoung commented 2 years ago

I notice the fine-grained contrastive loss takes the average of embeddings of words, which are top-k related with a clip, as positive word representation and other in-batch words as negative samples. But the implementation in this repository only takes the one with maximum similarity as the positive sample, and the number of negative words is limited by four. https://github.com/mengcaopku/LocVTP/blob/229bef693e19d1771da39666e65519b20426fb4b/clip/modeling/clip/clip.py#L107-L119 Did I misunderstand the code? Or are the differences between the paper and implementation indeed existing?

mengcaopku commented 2 years ago

There are some discrepancies between this version of the code and the paper. We just release it for reference and you may easily change the maximum selection to the top-K selection. We will update the code asap.

vateye commented 2 years ago

There are some discrepancies between this version of the code and the paper. We just release it for reference and you may easily change the maximum selection to the top-K selection. We will update the code asap.

I noticed this loss only can be applied to the video clip, does it also apply to image? Since when training with image (i.e., CC3M), the global alignment indicates that the image should align with the text. However, when it comes to the fine-grained loss, the loss enforces the image align with some words? Does it reasonable? How did you handling the loss when training in the setting of "Webvid2M + CC3M"? Thanks.

mengcaopku commented 2 years ago

It is omitted for the image dataset and we are developing a more powerful version which may attend to the image-level fine-grained alignment.

LukeForeverYoung commented 2 years ago

There are some discrepancies between this version of the code and the paper. We just release it for reference and you may easily change the maximum selection to the top-K selection. We will update the code asap.

I’m still confused since the implementation of FGLoss is also different from eq.3 in the paper. https://github.com/mengcaopku/LocVTP/blob/229bef693e19d1771da39666e65519b20426fb4b/clip/modeling/clip/loss.py#L59-L88 This implementation sounds like we sample negative words from the same sentence as the positive word belongs. But in the paper, negative words are sampled from the whole batch and are shared between each clip t, which sounds more make sense. Could you please provide the latest code snippet of FGLoss implementation to help understand eq.3?