Open erikrose opened 5 years ago
I can compute precision and recall without knowing the total number of tags (which is the only way to get the true negative count). But if I want the full confusion matrix and the conventional formula for accuracy, I need that total. Right now I'm thinking of outputting a confusion matrix plus accuracy, FN, and FP.
Maybe display precision and recall as well.
Recall would serve a lot of the same purpose as the current per-page success/failure metric and, further, give us finer resolution for when there are multiple positive tags per page. But figure out what to do about multiple positive tags any of which must be recognized.
Is one of the goals of this issue to identify a more durable accuracy metric for the trainer to use for optimization?
No, I'm trying to make it more durable across time, as a ruleset develops: specifically, across changes to the implicit pruning done by the dom() calls. Currently, you might widen your dom() parameter to sweep in a target tag you'd mistaken left out, but this also changes the denominator in your accuracy calculation. Thus, your new accuracy may be rewarded for the new target but may at the same time be penalized for some new non-targets you false-positive on.
That's my reasoning as it was, but I had trouble last week constructing a concrete example scenario. Can you see the hole in my logic?
Don't bother reading this; it's just the partially constructed scenario I haven't finished messing with yet:
<input type=text>
dom('input[type=text]')
Accuracy: 1
<input type=text>
<input type="">
dom('input[type=text]')
Accuracy: 1
dom() selects 3 nodes, 2 are targets
We get 1/2 right
50% accuracy
then bring in 10 more, 8 negative, 2 positive
dom() selects 13, 4 are targets
We get 2/4 right
50% accuracy
I think your logic is sound. A confusion matrix along with the associated ratios (FN, FP) and accuracy sounds good. Having the percent based metrics are good when iterating on the same ruleset or changes to a ruleset that won't mess with the denominator, and the raw counts will be good for understanding the changes in those metrics when the denominator does change.
I second the idea of outputting a confusion matrix in fathom-train
.
For the Fathom/Smoot articles ruleset, I am using a "paragraph" subtype (i.e. an article is composed of a number of paragraphs). The current output shows me a per-tag accuracy and a per-page accuracy for "paragraph", but it doesn't show me which "paragraph" elements I'm getting wrong (false positive or false negative), so it's hard to know how to improve. While a confusion matrix wouldn't fully answer that question, it'd allow me to see if I am leaning one way or the other.
I'm mildly concerned that introducing the total element count as a denominator (for the various metrics) may lead to noise, as when adding a page with a large number of elements to the corpus: it could perhaps crowd out the influence of smaller pages on the metrics.
On the other hand, (a large number of) non-target elements shouldn't be considered noise, because Fathom is, after all, concerned with every single tag: declaring that it is or is not a target. We should celebrate true negatives just as much as true positives (in general, from the app-agnostic position of our tools).
Either way, I'm recording this so…
The current accuracy-per-tag metric can have its denominator change if the set of candidate tags (the incoming tags selected by
dom()
calls) changes. This makes for an apples-to-oranges comparison with previous revisions of a ruleset. Let's switch the trainer to emit this metric instead:Accuracy: # tags right ÷ # tags in corpus False positives: # tags falsely positive ÷ # tags in corpus False negatives: # tags falsely negative ÷ # tags in corpus
I expect reported accuracy to go up since we're no longer assuming we got the pruning-via-
dom()
part right, so we may have to start showing more decimal places.