Closed akolonin closed 5 years ago
"Alternative" F1 estimation -- Alternative_F1_for_ALE_ILE%20clustering_2019-04-12.html
Writing down what I said in previous video calls: I don't think this is a good idea for two reasons:
Because everybody seemed to agree to this arguments during the call, I am closing this issue
Here is the current definition (quote from our AGI-2019 paper): "The first perspective is extent to which reference corpus is parsed at all – it is called “parse-ability” (PA) and it computes the average percentage of words in a sen - tence recognized by grammar tester: PA = (Σ(ki/ni))/N, where: PA – parse ability, N – total number of sentences, ki – number of words in i-th sentence recognized by the grammar tester, ni – total number of words in i-th sentence. For the second metric we use conventionally defined F-measure or F-score (F1) metric computed on basis of recall and precision, averaged across all sentences in the corpus as Recall = (Σ(ci/ei))/N and Precision = (Σ(ci/li))/N, where ci – number of cor- rectly identified links in i-th sentence, ei – number of expected links and li – number of identified links, including false positives. That is, for recall we take average per- sentence number of overlapping links in test and reference parses divided by the total number of links in reference parses. Respectively, for precision we take the overlap - ping number divided by the total number of links in test parses."
The problem is that if we have two sentences of 100 and 10 words/links with matches 90 and 1, the assessment will be average = (90/100 + 1/10)/2 = 0.5 - without of account to sentence length. However if we consider individual word/links or average with account to umber of them in the sentence the assessment would be 91/110 = 0.83 which is more "fair".
Here alternative is discussed: https://docs.google.com/document/d/1YtN0-hvGWHJy1_KzXSfGE8w_m3kU5m0LcMmOw4KHT3Q/edit#heading=h.twoiv52o0tou see appendices H and J in the bottom.
We should decide if we want to move to this metric for parses evaluation and when to do that if we decide so.
This issue extends #198