Better handling of Functions in evaluation

danielhers commented 4 years ago

Functions are currently moved to the root if they common in prediction and gold (see https://github.com/danielhers/ucca/issues/91#issuecomment-631846390), but a better handling would be "soft" matching of yields to allow excluding Functions in a non-symmetric way: for calculating precision, we should allow omitting Functions from the gold; and for recall, from the prediction.

nschneid commented 4 years ago

Note that unlike punctuation, the decision of whether a word should be Function or not is nontrivial.

So if we were to simply ignore all Function units like we do Punctuation units, each Function vs. non-Function difference between prediction and reference could result in mismatched spans at several levels.

Hence, better when computing precision to ignore Function units in the reference only that would prevent ancestor units from matching, vice versa for recall.

(The algorithm would need to be worked out: e.g. suppose we have

[A [F x] [D [F y] [E [F z] [C c]] ] ]

A and D have identical spans modulo F's, as do E and C. So this could effectively mean there are more unary edges if the other analysis chooses to put x, y, and z elsewhere. Would that make the scorer too lenient under the current policy that any category match is sufficient to count the span as correct? Also, when trying to match large units with many F descendants, do we have a combinatorial search space to decide whether to include each F? Hopefully not a problem in practice as units tend to be nested, so a failure of a unit to match will imply that its parent units with more non-F descendants will not match, assuming proper nesting on both sides.)

Function units themselves would not count toward the score (i.e., they are excluded from the list of matches even if present in both analyses).

nschneid commented 4 years ago

Also, when trying to match large units with many F descendants, do we have a combinatorial search space to decide whether to include each F? Hopefully not a problem in practice as units tend to be nested, so a failure of a unit to match will imply that its parent units with more non-F descendants will not match, assuming proper nesting on both sides.

I take that back.

On further reflection, if PRIMARY edges strictly form a tree (and no token can belong to multiple units), then it doesn't matter whether the tree is projective: the matching can be done bottom-up, once for precision and again for recall. I suppose there can be a chart formed by reordering the tokens so the graph being scored is projective. If a unit matches, that can be taking into account when checking whether its parent unit matches (so work determining that certain F's SHOULD be included doesn't have to be repeated).

nschneid commented 4 years ago

Suppose we had

SYS: [C [F the] [P party]] REF: [P [F the] [C party]]

1) If our policy is just to ignore the F's, that amounts to [P [C party]] vs. [C [P party]] (which would be scored fully correct?).

2) Or, the policy could be that F's count as part of the span, we are just flexible about matching that span to another span where the F's are missing or extra. And we only score spans where at least one terminal category is non-F. In which case there are two span matches, "the party" and "party", but neither has the correct category, so the labeled score is 0.

3) Or, sort of a combination between (1) and (2): when considering the predicted unit [C the party], we count a match if EITHER the full span matches with the right category OR the F-omitted span [C party] matches. This would be not quite the same as ignoring all F's, because under (1) both C and P units would be counted correct for both precision and recall, while under (3), only C would count as correct for precision and only P for recall.

Now consider:

SYS: [C [E the] [P party]] REF: [P [F the] [C party]]

Under policy (2):
- precision would be out of 3 units: neither [C the party] nor [E the] matches with the correct category, but [P party] does if you ignore the F in REF, so 1/3
- recall would be out of 2 units, [P the party] and [C party], neither of which match with the correct category, so 0.
Under (3), precision would be 0/3, and recall would be 1/2.

huji-nlp / ucca

Better handling of Functions in evaluation #94