Closed thespectrewithin closed 5 years ago
There is not perfect solution but you can try either
1). Train spm from the pretokenized text. We need to use the same pretokenizer for CoNLL. 2). Heuristically restore the original raw text. For Latin-based language, we might be able to use Mose's detokenizer. https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl
Dear authors, I have a language model pretrained on raw text tokenized directly by SPM unigram, and I would like to finetune it to some downstream tasks.
My problem is some downstream tasks provide pretokenized text instead of raw text (e.g. conll 2003 NER), and using SPM unigram to tokenize pretokenized text provides different outputs than raw text, therefore leading to a pretraining time and test time discrepancy cuz the tokenization is different. I'm wondering if there's a way to mitigate this issue?
Thanks in advance.