princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.36k stars 507 forks source link

The SimCSE object retains a reference to external sentences #145

Closed ngiatsog closed 2 years ago

ngiatsog commented 2 years ago

First of all, congratulations on your work!

I found a subtle issue in the index building method of SimCSE, which can be potentially confusing. If the build_index method of the SimCSE object is fed with a sentence list instead of a file, the SimCSE object retains a reference to the external list. This will cause inconsistencies if the list is modified externally (I had the problem when I shuffled the sentences list externally to the code)

I think that the issue is in line 149 of tool.py https://github.com/princeton-nlp/SimCSE/blob/30b08875a39d0e89d71f17c57bd0dcc18e7c2f15/simcse/tool.py#L149 and can be resolved by storing a copy of sentences.

gaotianyu1350 commented 2 years ago

Hi,

Thanks for reporting this issue! However, storing a copy of the list will cause unnecessary memory overhead. I suggest when using it, be extra careful with the list you pass in.

Best, Tianyu