princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.36k stars 507 forks source link

Is it possible to use SimCSE for Russian Language #163

Closed silkinresearch closed 2 years ago

silkinresearch commented 2 years ago

I tried to use SimCSE with BERT from Hugging Face, that is suitable for Russian Language. (https://huggingface.co/DeepPavlov/rubert-base-cased)

I wrote following code, that is provided in your WIKI: Load-SimCSE-Models.

model_SCE = SimCSE("DeepPavlov/rubert-base-cased")
embeddings = model_SCE.encode(list_of_sentences) # 20000 centences

The problem is the embeddings are calculated in less than a second, that is very strange. The result of using such embeddings works slightly better than random algorithm. At the same time https://www.sbert.net/ can deal with the BERT and shows satisfactory results.

Did I get embeddings with SimCSE correctly or not?

gaotianyu1350 commented 2 years ago

Hi,

You cannot directly load another pre-trained model in the SimCSE repo (it can only load either our trained models or your own trained models). You can train rubert following our training guideline and then load the trained model through the simcse package.

silkinresearch commented 2 years ago

Ok. Thank you for your response. Everything is clear now 🙂