princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.31k stars 502 forks source link

Number of rows in NLI dataset #238

Closed xlpczv closed 1 year ago

xlpczv commented 1 year ago

Hello, I have a question for the NLI dataset. In the paper, it is written that 314k samples are used for supervised SimCSE training using the NLI dataset. However, when I read the dataset provided by your github, there were only 275,601 rows. What is the difference between the data you provided and the data written in the paper?

Additionally, I ask if you can provide other supervised datasets, QQP, etc. for example, you used in the experiments. That would be very helpful for my research.

Thank you very much for the wonderful github repository.

gaotianyu1350 commented 1 year ago

Hi,

The 314k data refer to the NLI dataset without hard negatives. When using hard negatives, some of the examples are filtered out because they don't have a corresponding hard example. This dataset is the one that we used for our final and strongest model.

Sorry that we don't have a copy of the other datasets used anymore. However, we didn't do any special processing to those datasets and you can download the original ones from their corresponding sources.

xlpczv commented 1 year ago

Hi, thank you for the answer. When generating my own data, I will refer to this advice. Thank you.