google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.03k stars 1.16k forks source link

How to deal with corpus with mixed language? #749

Closed winston0410 closed 2 years ago

winston0410 commented 2 years ago

I have a corpus that would mix up Cantonese and English. It is the only source of data of that specific domain, so I don't have another cleaner source for training.

我去咗Central London食飯。

In the example sentence above, I am only interested in Cantonese not English, and I don't want to include any English in that trained output. Is it possible to ignore all the English here? Can I replace them with an unknown token like this?

我去咗<unk>食飯。
taku910 commented 2 years ago

No. Please implements a preprocessor that removes English part.