More precise PA and F1?

akolonin commented 5 years ago

Here is the current definition (quote from our AGI-2019 paper): "The first perspective is extent to which reference corpus is parsed at all – it is called “parse-ability” (PA) and it computes the average percentage of words in a sen - tence recognized by grammar tester: PA = (Σ(ki/ni))/N, where: PA – parse ability, N – total number of sentences, ki – number of words in i-th sentence recognized by the grammar tester, ni – total number of words in i-th sentence. For the second metric we use conventionally defined F-measure or F-score (F1) metric computed on basis of recall and precision, averaged across all sentences in the corpus as Recall = (Σ(ci/ei))/N and Precision = (Σ(ci/li))/N, where ci – number of cor- rectly identified links in i-th sentence, ei – number of expected links and li – number of identified links, including false positives. That is, for recall we take average per- sentence number of overlapping links in test and reference parses divided by the total number of links in reference parses. Respectively, for precision we take the overlap - ping number divided by the total number of links in test parses."

The problem is that if we have two sentences of 100 and 10 words/links with matches 90 and 1, the assessment will be average = (90/100 + 1/10)/2 = 0.5 - without of account to sentence length. However if we consider individual word/links or average with account to umber of them in the sentence the assessment would be 91/110 = 0.83 which is more "fair".

Here alternative is discussed: https://docs.google.com/document/d/1YtN0-hvGWHJy1_KzXSfGE8w_m3kU5m0LcMmOw4KHT3Q/edit#heading=h.twoiv52o0tou see appendices H and J in the bottom.

We should decide if we want to move to this metric for parses evaluation and when to do that if we decide so.

This issue extends #198

OlegBaskov commented 5 years ago

"Alternative" F1 estimation -- Alternative_F1_for_ALE_ILE%20clustering_2019-04-12.html

glicerico commented 5 years ago

Writing down what I said in previous video calls: I don't think this is a good idea for two reasons:

The example explained above with a 100-word sentence having 90 correct links and a 10-word sentence having only 1 correct link, although possible, is not realistic. Normally, longer sentences will be more difficult to parse correctly. The alternative method proposed gives a lot more weight to long sentences, which is detrimental to quality scores, and also unfair because longer sentences are harder to parse.
The community calculates F1 for each sentence and then averages among all sentences, like our current code does. E.g. see the compute_metrics() function of the AdaGram evaluation code: https://github.com/sbos/AdaGram.jl/blob/332d80e448c83734d05715040d632e51fdfc3f58/test-all.py#L61

Because everybody seemed to agree to this arguments during the call, I am closing this issue

singnet / language-learning

More precise PA and F1? #200