rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
148 stars 11 forks source link

Tokenization evaluation script #29

Closed rth closed 5 years ago

rth commented 5 years ago

This adds a script to evaluate tokenization on the UD treebanks. Current results are as follows,

tokenizer      MosesTokenizer  regexp  spacy  unicode-segmentation  vtext
treebank                                                                 
English-EWT             0.935   0.774  0.982                 0.940  0.969
English-GUM             0.968   0.806  0.992                 0.964  0.990
UD_French-GSD           0.867   0.763  0.947                 0.865  0.866

where we compare the vtext tokenizer performance with spacy and regexp tokenizers.