Closed Lukecn1 closed 2 years ago
Hi,
If there is a similar STS dataset for the language, you can use the corresponding dataset. Otherwise you can use the hyper parameter for English for a rough estimation.
Fair game. As I dont have sts dataset for the language needed, ill try and implement a validation functionality that doesnt rely on sts data.
Ill make a PR once i have implemented and tested it :)
There are translated STS-Benchmark in other languages. https://huggingface.co/datasets/stsb_multi_mt, https://github.com/PhilipMay/stsb-multi-mt/tree/main/data and https://github.com/PhilipMay/stsb-multi-mt. PS: You need to preprocess these csv files as '\t' separated and replace them in STS-Benchmark.
There are translated STS-Benchmark in other languages. https://huggingface.co/datasets/stsb_multi_mt, https://github.com/PhilipMay/stsb-multi-mt/tree/main/data and https://github.com/PhilipMay/stsb-multi-mt. PS: You need to preprocess these csv files as '\t' separated and replace them in STS-Benchmark.
Thanks for sharing :)
I found a way around it, and have written a routine that allows for validating on custom data during training. This enables validating on other than sentence pair data as well.
Hi there, thank you for the excellent paper and repo!
If I want to train a supervised simcse model using a plm based on another language than english, how can I validate the model performance during training, given that the repo defaults to using STS and other English language tasks?
regards Lukas