princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.36k stars 507 forks source link

search function returning 0 scores at some point #162

Closed TheoSeo93 closed 2 years ago

TheoSeo93 commented 2 years ago

Hi, I tried querying the similarity score for 20 candidates, and the query result seems to generate 0's for some candidates. It seems the first few candidates have scores, but at some certain point, the scores are not being generated. Would there be any reason for this?

        self.sim_model = SimCSE("princeton-nlp/sup-simcse-roberta-large")
        self.sim_model.build_index(candidate_sents, use_faiss=None, faiss_fast=False, device='cpu')
        queries = self.sim_model.search(original_sent)
gaotianyu1350 commented 2 years ago

Can you provide the full script including candidate_sents and original_sent that can trigger the error?

TheoSeo93 commented 2 years ago

Hi! Thank you for response. It was actually not returning the score of 0, but some candidates aren't there when queried. Is there any reason that some candidates' scores not being generated, and is there any way to let the model generate all the scores?



    candidate_words = ['two', 'dead', 'poor', 'three', 'other', 'unfortunate', 'same', 'drunken', 'many', 'sick', 'various', 'old', 'few', 'young', 'miserable', 'four', 'wandering', 'evil', 'strange']
    # Long sentence
    sent_masked = "My imagination worked up to such a height, and brought me into such excess of vapours, or what else I may call it, that I actually supposed myself often upon the spot, at my old castle, behind the trees; saw my old Spaniard, Friday's father, and the [TARGET] sailors I left upon the island; nay, I fancied I talked with them, and looked at them steadily, though I was broad awake, as at persons just before me"
    TARGET = '[TARGET]'

    original_sent = sent_masked.replace(TARGET, 'reprobate') # The target sentence
    candidate_sents = [sent_masked.replace(TARGET, candidate) for candidate in candidate_words] # Paraphrased sentences
    sim_model = SimCSE("princeton-nlp/sup-simcse-roberta-large")
    sim_model.build_index(candidate_sents, use_faiss=None, faiss_fast=False, device='cpu')
    queries = sim_model.search(original_sent)
    scores = {}
    for query in queries:
        scores[candidate_words[candidate_sents.index(query[0])]] = query[1]
    print(scores)
    # {'evil': 0.9965873, 'unfortunate': 0.99653846, 'miserable': 0.99558944, 'poor': 0.9953263, 'wandering': 0.9938461} 
    `
gaotianyu1350 commented 2 years ago

Hi,

The function by default will only return the top-5 results. Please refer to the manual here: https://github.com/princeton-nlp/SimCSE/wiki/Search-Sentences-from-Index