google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.29k stars 1.18k forks source link

How to tokenize sentence to all characters #465

Closed ynebula closed 4 years ago

ynebula commented 4 years ago

I am studying machine reading comprehension on xlmroberta. My data is korquad.

I need to tokenize all word to character. e.g. by english This is a dog -> _T h i s _i s _a _d o g

please let me know.

taku910 commented 4 years ago

Please use --model_type=char to train spm.