rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
146 stars 11 forks source link

Improve french tokenizer #30

Closed rth closed 5 years ago

rth commented 5 years ago

This adds some specific rules for french tokenization, and improves the language independent tokenizer following #28 .

Latest UD treebank evaluation results are,

tokenizer      regexp  spacy  unicode-segmentation  vtext
lang treebank                                            
de   GSD        0.818  0.934                 0.953  0.947
en   EWT        0.743  0.975                 0.927  0.968
     GUM        0.807  0.994                 0.980  0.997
fr   Sequoia    0.748  0.946                 0.861  0.943