rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
146 stars 11 forks source link

Add VTextTokenizer #28

Closed rth closed 5 years ago

rth commented 5 years ago

Add a preliminary implementation of VTextTokenizer, that implements manual rules on top of unicode-segmentation aiming to improve tokenization accuracy on the Universal Dependencies corpus tokenization.