Better unicode support in tokenization rules

rth / vtext

Simple NLP in Rust with Python bindings

Apache License 2.0

147 stars 11 forks source link

Better unicode support in tokenization rules #31

Open rth opened 5 years ago

rth commented 5 years ago

Currently, the VTextTokenizer first computes Unicode segmentation (which should handle Unicode well by definition) than applies a few simple rules on top to produce tokenization that is more standard in NLP (and possibly language dependent).

These rules might need to be generalized a bit to handle Unicode better. For instance, currently we merge tokens linked by - but only the ascii one, not on other Unicode variants.

rth commented 5 years ago

Using https://github.com/BurntSushi/utf8-ranges would probably be quite useful without sacrificing speed too much.