The corresponding description of topk_pooling in the paper is "Here, we directly select only the frames with the highest cosine similarity to a given text as a proxy for semantic similarity.", which means the similarity scores are calculated with the given text and all frames of the paired video. But I find that the code of topk_pooling seems to calculate the similarity between the video frame and all the text in the same batch before taking the top k. This is inconsistent with the description in the paper and unreasonable. So I'm wondering is there any mistake in my understanding ?
The corresponding description of topk_pooling in the paper is "Here, we directly select only the frames with the highest cosine similarity to a given text as a proxy for semantic similarity.", which means the similarity scores are calculated with the given text and all frames of the paired video. But I find that the code of topk_pooling seems to calculate the similarity between the video frame and all the text in the same batch before taking the top k. This is inconsistent with the description in the paper and unreasonable. So I'm wondering is there any mistake in my understanding ?