What is the `AligndAcc` score?

arademaker commented 6 years ago

I didn't find the definition, where is it defined?

jwijffels commented 6 years ago

It's defined here: http://universaldependencies.org/conll17/evaluation.html It basically aligns the words from the prediction to the ones from the know 'gold' test dataset. Once they are aligned, accuracy metrics are computed.

arademaker commented 6 years ago

Sorry, this find this information in the page, can you be more specific? ;-)

jwijffels commented 6 years ago

The evaluation script was taken from here: https://github.com/ufal/conll2017/tree/master/evaluation_script and is put also in the evaluation_script folder in this repository (https://github.com/jwijffels/udpipe-spacy-comparison/blob/master/evaluation_script/conll17_ud_eval.py).

It computes Precision, Recall, F1 score and Accuracy = AligndAcc for the different parts of the annotation namely:

finding tokens,
finding sentences,
finding words as in multi-token words,
finding the universal parts of speech tag upos,
finding the treebank specific parts of speech tag xpos,
finding the list of morphological features called feats,
finding well does upos+xpos+feats match: Alltags
finding how well do lemmas match: Lemmas
finding the correct syntactic head: UAS,
finding the correct syntactic head and also and the correct dependency label (dep_rel) ignoring subtypes: LAS,
finding the correct syntactic head and also the correct dependency label ignoring subtypes and ignoring functional words and punctuation: CLAS

To make this concrete, below is shown the result of an annotation (head_token_id in R output shown below is called the syntactic head, dep_rel is the dependency label)

We are basically comparing 2 files,

one with the holdout test data containing the real human-annotated results (called gold) and
one with the annotation from the model

In order to compare these 2 files and because the outputted sequence of tokens from the model might be different than the sequence of tokens in the gold file (real-human annotated results), you basically need to align the tokens in the 2 files. T That alignment means we look within each sentence if we have matching tokens. These are put next to each other along with the predicted upos/xpos/feats/dependency head and dependency relation. Based on that file the precision, recall, F1 and accuracy score can be computed. This alignment is explained at http://universaldependencies.org/conll17/evaluation.html. AligndAcc just means based on the aligned data, how many percent of the outputted values from the model are the same as the human annotated holdout test data. Precision P is the number of correct values divided by the number of system-produced values. Recall R is the number of correct values divided by the number of gold-standard values . F1 score = 2PR / (P+R)

If this is still unclear, the authors of the evaluation script can be found at https://github.com/ufal/conll2017

library(udpipe)
dl <- udpipe_download_model(language = "english")
udmodel_en <- udpipe_load_model(file = "english-ud-2.0-170801.udpipe")

x <- udpipe_annotate(udmodel_en, 
                     x = "the economy is weak but the outloook is bright")
as.data.frame(x)

> as.data.frame(x)
  doc_id paragraph_id sentence_id                                       sentence token_id    token    lemma  upos xpos
1   doc1            1           1 the economy is weak but the outloook is bright        1      the      the   DET   DT
2   doc1            1           1 the economy is weak but the outloook is bright        2  economy  economy  NOUN   NN
3   doc1            1           1 the economy is weak but the outloook is bright        3       is       be   AUX  VBZ
4   doc1            1           1 the economy is weak but the outloook is bright        4     weak     weak   ADJ   JJ
5   doc1            1           1 the economy is weak but the outloook is bright        5      but      but CCONJ   CC
6   doc1            1           1 the economy is weak but the outloook is bright        6      the      the   DET   DT
7   doc1            1           1 the economy is weak but the outloook is bright        7 outloook outloook  NOUN   NN
8   doc1            1           1 the economy is weak but the outloook is bright        8       is       be   AUX  VBZ
9   doc1            1           1 the economy is weak but the outloook is bright        9   bright   bright   ADJ   JJ
                                                  feats head_token_id dep_rel deps            misc
1                             Definite=Def|PronType=Art             2     det <NA>            <NA>
2                                           Number=Sing             4   nsubj <NA>            <NA>
3 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin             4     cop <NA>            <NA>
4                                            Degree=Pos             0    root <NA>            <NA>
5                                                  <NA>             9      cc <NA>            <NA>
6                             Definite=Def|PronType=Art             7     det <NA>            <NA>
7                                           Number=Sing             9   nsubj <NA>            <NA>
8 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin             9     cop <NA>            <NA>
9                                            Degree=Pos             4    conj <NA> SpacesAfter=\\n

jwijffels / udpipe-spacy-comparison

What is the `AligndAcc` score? #2