google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.11k stars 1.17k forks source link

Tokenize pretokenized text using spm model trained on raw text #362

Closed thespectrewithin closed 5 years ago

thespectrewithin commented 5 years ago

Dear authors, I have a language model pretrained on raw text tokenized directly by SPM unigram, and I would like to finetune it to some downstream tasks.

My problem is some downstream tasks provide pretokenized text instead of raw text (e.g. conll 2003 NER), and using SPM unigram to tokenize pretokenized text provides different outputs than raw text, therefore leading to a pretraining time and test time discrepancy cuz the tokenization is different. I'm wondering if there's a way to mitigate this issue?

Thanks in advance.

taku910 commented 5 years ago

There is not perfect solution but you can try either

1). Train spm from the pretokenized text. We need to use the same pretokenizer for CoNLL. 2). Heuristically restore the original raw text. For Latin-based language, we might be able to use Mose's detokenizer. https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl