google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.07k stars 1.16k forks source link

Tokenize at the word level without spacers nor joiners #1001

Closed HURIMOZ closed 4 months ago

HURIMOZ commented 4 months ago

I want to tokenize at the word level without spacers nor joiners. Is that possible? In fact, I want to leverage pretrained embeddings and Iʻm not able to leverage them when the tokens carry spacers and joiners. Also, is it possible to keep joiners and spacers and still leverage embeddings effectively? My pretrained embeddings do not carry any spacers nor joiners.

taku910 commented 4 months ago

How is the "word level" defined here? The definition is not that clear when CJK is included. Specific examples would be helpful.

HURIMOZ commented 4 months ago

Donʻt worry, I was able to source subword pretrained embeddings. Thank you.