Switch to more durable accuracy metric

erikrose commented 5 years ago

The current accuracy-per-tag metric can have its denominator change if the set of candidate tags (the incoming tags selected by dom() calls) changes. This makes for an apples-to-oranges comparison with previous revisions of a ruleset. Let's switch the trainer to emit this metric instead:

Accuracy: # tags right ÷ # tags in corpus False positives: # tags falsely positive ÷ # tags in corpus False negatives: # tags falsely negative ÷ # tags in corpus

I expect reported accuracy to go up since we're no longer assuming we got the pruning-via-dom() part right, so we may have to start showing more decimal places.

erikrose commented 5 years ago

I can compute precision and recall without knowing the total number of tags (which is the only way to get the true negative count). But if I want the full confusion matrix and the conventional formula for accuracy, I need that total. Right now I'm thinking of outputting a confusion matrix plus accuracy, FN, and FP.

erikrose commented 5 years ago

Maybe display precision and recall as well.

erikrose commented 5 years ago

Recall would serve a lot of the same purpose as the current per-page success/failure metric and, further, give us finer resolution for when there are multiple positive tags per page. But figure out what to do about multiple positive tags any of which must be recognized.

danielhertenstein commented 5 years ago

Is one of the goals of this issue to identify a more durable accuracy metric for the trainer to use for optimization?

erikrose commented 5 years ago

No, I'm trying to make it more durable across time, as a ruleset develops: specifically, across changes to the implicit pruning done by the dom() calls. Currently, you might widen your dom() parameter to sweep in a target tag you'd mistaken left out, but this also changes the denominator in your accuracy calculation. Thus, your new accuracy may be rewarded for the new target but may at the same time be penalized for some new non-targets you false-positive on.

That's my reasoning as it was, but I had trouble last week constructing a concrete example scenario. Can you see the hole in my logic?

Don't bother reading this; it's just the partially constructed scenario I haven't finished messing with yet:

<input type=text>
dom('input[type=text]')
Accuracy: 1

<input type=text>
<input type="">
dom('input[type=text]')
Accuracy: 1

dom() selects 3 nodes, 2 are targets
We get 1/2 right

50% accuracy

then bring in 10 more, 8 negative, 2 positive
dom() selects 13, 4 are targets
We get 2/4 right
50% accuracy

danielhertenstein commented 5 years ago

I think your logic is sound. A confusion matrix along with the associated ratios (FN, FP) and accuracy sounds good. Having the percent based metrics are good when iterating on the same ruleset or changes to a ruleset that won't mess with the denominator, and the raw counts will be good for understanding the changes in those metrics when the denominator does change.

biancadanforth commented 4 years ago

I second the idea of outputting a confusion matrix in fathom-train.

For the Fathom/Smoot articles ruleset, I am using a "paragraph" subtype (i.e. an article is composed of a number of paragraphs). The current output shows me a per-tag accuracy and a per-page accuracy for "paragraph", but it doesn't show me which "paragraph" elements I'm getting wrong (false positive or false negative), so it's hard to know how to improve. While a confusion matrix wouldn't fully answer that question, it'd allow me to see if I am leaning one way or the other.

erikrose commented 4 years ago

I'm mildly concerned that introducing the total element count as a denominator (for the various metrics) may lead to noise, as when adding a page with a large number of elements to the corpus: it could perhaps crowd out the influence of smaller pages on the metrics.

On the other hand, (a large number of) non-target elements shouldn't be considered noise, because Fathom is, after all, concerned with every single tag: declaring that it is or is not a target. We should celebrate true negatives just as much as true positives (in general, from the app-agnostic position of our tools).

Either way, I'm recording this so…

I don't forget to check it out after implementing and
My reasoning is here if I get concerned again

mozilla / fathom

Switch to more durable accuracy metric #101