Closed arademaker closed 6 years ago
It's defined here: http://universaldependencies.org/conll17/evaluation.html It basically aligns the words from the prediction to the ones from the know 'gold' test dataset. Once they are aligned, accuracy metrics are computed.
Sorry, this find this information in the page, can you be more specific? ;-)
The evaluation script was taken from here: https://github.com/ufal/conll2017/tree/master/evaluation_script and is put also in the evaluation_script folder in this repository (https://github.com/jwijffels/udpipe-spacy-comparison/blob/master/evaluation_script/conll17_ud_eval.py).
It computes Precision, Recall, F1 score and Accuracy = AligndAcc for the different parts of the annotation namely:
tokens
, sentences
, words
as in multi-token words, upos
, xpos
, feats
, Alltags
Lemmas
UAS
, LAS
, CLAS
To make this concrete, below is shown the result of an annotation (head_token_id in R output shown below is called the syntactic head, dep_rel is the dependency label)
We are basically comparing 2 files,
In order to compare these 2 files and because the outputted sequence of tokens from the model might be different than the sequence of tokens in the gold file (real-human annotated results), you basically need to align the tokens in the 2 files. T
That alignment means we look within each sentence if we have matching tokens. These are put next to each other along with the predicted upos/xpos/feats/dependency head and dependency relation. Based on that file the precision, recall, F1 and accuracy score can be computed. This alignment is explained at http://universaldependencies.org/conll17/evaluation.html.
AligndAcc
just means based on the aligned data, how many percent of the outputted values from the model are the same as the human annotated holdout test data.
Precision P is the number of correct values divided by the number of system-produced values. Recall R is the number of correct values divided by the number of gold-standard values . F1 score = 2PR / (P+R)
If this is still unclear, the authors of the evaluation script can be found at https://github.com/ufal/conll2017
library(udpipe)
dl <- udpipe_download_model(language = "english")
udmodel_en <- udpipe_load_model(file = "english-ud-2.0-170801.udpipe")
x <- udpipe_annotate(udmodel_en,
x = "the economy is weak but the outloook is bright")
as.data.frame(x)
> as.data.frame(x)
doc_id paragraph_id sentence_id sentence token_id token lemma upos xpos
1 doc1 1 1 the economy is weak but the outloook is bright 1 the the DET DT
2 doc1 1 1 the economy is weak but the outloook is bright 2 economy economy NOUN NN
3 doc1 1 1 the economy is weak but the outloook is bright 3 is be AUX VBZ
4 doc1 1 1 the economy is weak but the outloook is bright 4 weak weak ADJ JJ
5 doc1 1 1 the economy is weak but the outloook is bright 5 but but CCONJ CC
6 doc1 1 1 the economy is weak but the outloook is bright 6 the the DET DT
7 doc1 1 1 the economy is weak but the outloook is bright 7 outloook outloook NOUN NN
8 doc1 1 1 the economy is weak but the outloook is bright 8 is be AUX VBZ
9 doc1 1 1 the economy is weak but the outloook is bright 9 bright bright ADJ JJ
feats head_token_id dep_rel deps misc
1 Definite=Def|PronType=Art 2 det <NA> <NA>
2 Number=Sing 4 nsubj <NA> <NA>
3 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 cop <NA> <NA>
4 Degree=Pos 0 root <NA> <NA>
5 <NA> 9 cc <NA> <NA>
6 Definite=Def|PronType=Art 7 det <NA> <NA>
7 Number=Sing 9 nsubj <NA> <NA>
8 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 9 cop <NA> <NA>
9 Degree=Pos 4 conj <NA> SpacesAfter=\\n
I didn't find the definition, where is it defined?