Closed rth closed 5 years ago
This adds a script to evaluate tokenization on the UD treebanks. Current results are as follows,
tokenizer MosesTokenizer regexp spacy unicode-segmentation vtext treebank English-EWT 0.935 0.774 0.982 0.940 0.969 English-GUM 0.968 0.806 0.992 0.964 0.990 UD_French-GSD 0.867 0.763 0.947 0.865 0.866
where we compare the vtext tokenizer performance with spacy and regexp tokenizers.
This adds a script to evaluate tokenization on the UD treebanks. Current results are as follows,
where we compare the vtext tokenizer performance with spacy and regexp tokenizers.