princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.33k stars 505 forks source link

Validating on other language #184

Closed Lukecn1 closed 2 years ago

Lukecn1 commented 2 years ago

Hi there, thank you for the excellent paper and repo!

If I want to train a supervised simcse model using a plm based on another language than english, how can I validate the model performance during training, given that the repo defaults to using STS and other English language tasks?

regards Lukas

gaotianyu1350 commented 2 years ago

Hi,

If there is a similar STS dataset for the language, you can use the corresponding dataset. Otherwise you can use the hyper parameter for English for a rough estimation.

Lukecn1 commented 2 years ago

Fair game. As I dont have sts dataset for the language needed, ill try and implement a validation functionality that doesnt rely on sts data.

Ill make a PR once i have implemented and tested it :)

yiren-jian commented 1 year ago

There are translated STS-Benchmark in other languages. https://huggingface.co/datasets/stsb_multi_mt, https://github.com/PhilipMay/stsb-multi-mt/tree/main/data and https://github.com/PhilipMay/stsb-multi-mt. PS: You need to preprocess these csv files as '\t' separated and replace them in STS-Benchmark.

Lukecn1 commented 1 year ago

There are translated STS-Benchmark in other languages. https://huggingface.co/datasets/stsb_multi_mt, https://github.com/PhilipMay/stsb-multi-mt/tree/main/data and https://github.com/PhilipMay/stsb-multi-mt. PS: You need to preprocess these csv files as '\t' separated and replace them in STS-Benchmark.

Thanks for sharing :)

I found a way around it, and have written a routine that allows for validating on custom data during training. This enables validating on other than sentence pair data as well.