princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.31k stars 500 forks source link

Question about how did paper preprocess dataset #236

Closed Lim-Sung-Jun closed 1 year ago

Lim-Sung-Jun commented 1 year ago

I am writing to inquire about the preprocessing of the ParaNMT-50M dataset for use with your SimCSE model.

As you may know, the ParaNMT-50M dataset consists of source sentences, paraphrases, and similarity scores. I am particularly interested in adapting your SimCSE model for contrastive learning with the ParaNMT-50M dataset. However, I could not find explicit details in your paper about how you preprocessed the dataset for this purpose.

I would be grateful if you could provide some insights or guidelines on the following:

How did you create positive and negative samples from the dataset? Specifically, did you use similarity score thresholds, or any other strategies, to identify positive and negative pairs? Were there any specific cleaning, filtering, or augmentation steps you employed to ensure high-quality paraphrases and negative samples? Did you experiment with different hard negative mining strategies for the ParaNMT-50M dataset? If so, could you share your findings or recommendations?

gaotianyu1350 commented 1 year ago

For ParaNMT, we simply used their paraphrase pairs and there is no additional filtering or processing. We also didn't use any hard negatives and we simply trained on in-batch negatives.

Lim-Sung-Jun commented 1 year ago

For ParaNMT, we simply used their paraphrase pairs and there is no additional filtering or processing. We also didn't use any hard negatives and we simply trained on in-batch negatives.

thanks 👍

I have a couple of questions about the other datasets you mentioned in your paper, namely QQP and Flickr30k.

In the case of QQP, I noticed that one source of negative examples were pairs of "related questions", which are not necessarily semantically equivalent despite pertaining to similar topics. I was curious to know whether you only used in-batch negatives or if you also used hard negatives?

Regarding Flickr30k, I understand that you used two captions of the same image as positive examples, which were randomly selected from the available five captions. I was wondering if you also used in-batch negatives or if you generated hard negatives using any specific technique.

Lim-Sung-Jun commented 1 year ago

For ParaNMT, we simply used their paraphrase pairs and there is no additional filtering or processing. We also didn't use any hard negatives and we simply trained on in-batch negatives.

gaotianyu1350 commented 1 year ago

Hi,

For all datasets other than the NLI datasets, we only use the positive pairs and simple in-batch negatives.

Lim-Sung-Jun commented 1 year ago

Hi,

For all datasets other than the NLI datasets, we only use the positive pairs and simple in-batch negatives.

thank you!

have a good day~

Lim-Sung-Jun commented 1 year ago

I have a question regarding the QQP dataset and the supervised datasets you have created.

In the case of the QQP dataset, did you use the "is_duplicate" column with a value of 1 to indicate positive samples? If this is correct, then there are 149,263 positive samples. However, in the paper, the dataset size is mentioned to be 134k. Could you please explain this discrepancy?

Additionally, if possible, could you provide the three supervised datasets you have created?

gaotianyu1350 commented 1 year ago

Yes we used the is_duplicate=1 columns. We are not sure why there is a discrepancy in the numbers (maybe due to different versions of datasets).

github-actions[bot] commented 1 year ago

Stale issue message