Tokenize pretokenized text using spm model trained on raw text

google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

Apache License 2.0

10.11k stars 1.17k forks source link

Dear authors, I have a language model pretrained on raw text tokenized directly by SPM unigram, and I would like to finetune it to some downstream tasks.

My problem is some downstream tasks provide pretokenized text instead of raw text (e.g. conll 2003 NER), and using SPM unigram to tokenize pretokenized text provides different outputs than raw text, therefore leading to a pretraining time and test time discrepancy cuz the tokenization is different. I'm wondering if there's a way to mitigate this issue?

Thanks in advance.

google / sentencepiece

Tokenize pretokenized text using spm model trained on raw text #362