princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.31k stars 502 forks source link

Why so many sentences in your nli datasets are grammarly incorrect? #235

Closed leoozy closed 1 year ago

leoozy commented 1 year ago

Thank you for your excellent job. I am running the supervised setting and find that many sentences in your nli dataset are grammarly incorrect. Such as :", heritage assets, Federal mission PP&E), uncertain historical cost basis ". The SNLI and MNLI dataset are human labeled dataset and do not have such sentences I guess. Do you have some post-processing of these sentences ? Thank you!

gaotianyu1350 commented 1 year ago

Hi,

We directly take the SNLI and MNLI datasets and that might be some noise from the dataset.

leoozy commented 1 year ago

Thank you for your help!