mala-lab / SIC-CADS

Code Implementation of "Simple Image-level Classification Improves Open-vocabulary Object Detection" (AAAI'24)
22 stars 3 forks source link

Where Use the cos_sim between the Learned Text Embeddings and CLIP Text Embeddings? #2

Open Key-lei opened 10 months ago

Key-lei commented 10 months ago

Thanks for this interesting work.

This paper uses cos_sim to compute the simliarity between Learned Text Embeddings and CLIP Text Embeddings,But I can find out where it's using it.

if not self.multi_scale:
            pred_ml_scores = self.logit_scale * self.text_embedding(text_features)
        else:
            pred_ml_scores = self.logit_scale * self.get_multi_level_scores(text_features)

        mlr_loss = self.get_rank_loss(pred_ml_scores, batched_inputs)

There doesn't seem to be a calculation going on here.

frh23333 commented 10 months ago

Hello, both text_features and text_embedding have been normalized before, so the dot product of two vectors is equal to cos_sim.

Key-lei commented 10 months ago

Thank you for your answer, very interesting work!🎉🎉🎉

Key-lei commented 10 months ago

I'm sorry to bother you again, but I still can't understand the cosine similarity calculation. logit_scale is a floating point number,

if not self.multi_scale:
            pred_ml_scores = self.logit_scale * self.text_embedding(text_features)
else:
            pred_ml_scores = self.logit_scale * self.get_multi_level_scores(text_features)

    mlr_loss = self.get_rank_loss(pred_ml_scores, batched_inputs)

text_embedding is from the img_features, text_features = self.extract_global_feature(features) Here there is only a linear layer mapping. The expression of the formula in the paper is as follows image , but I can't find clip_text_embedding. Can you help me find where in the code to use clip_text_embedding?