If train SimCSE on Chinese corpus, do preprocessing or not?

princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821

MIT License

3.36k stars 507 forks source link

If train SimCSE on Chinese corpus, do preprocessing or not? #137

Closed MrRace closed 2 years ago

MrRace commented 2 years ago

If train SimCSE on Chinese corpus, does it need text preprocessing for example remove punctuations or emojis? Thanks a lot~

gaotianyu1350 commented 2 years ago

Hi,

Since all the tokenizations are done in the training script, no preprocessing is needed for Chinese corpora. You may want to choose a Chinese pre-trained model or a multilingual pre-trained model for initialization though.

MrRace commented 2 years ago

@gaotianyu1350 In my own corpus, I find about 25% characters is OOV when use official Chinese BERT-base. Therefore these characters will be regarded as UNK, any advice?

gaotianyu1350 commented 2 years ago

Maybe use another Chinese pretrained model, for exae, Roberta based ones