I originally wrote it that way to imitate some code from covid_symptom/paper.py that was doing three-way comparisons.
But we often have need of simple two-way comparisons. And even my three-way comparison may not have been the best way to compare (paper.py might have been doing something a little unique, where it was actually doing two different two-way comparisons and I misunderstood intent).
So I intend to bring back multi-way comparison at some point, but it will with more design input from a clinical person. But for now, two-way is actively useful, so let's start with that.
In addition, there were some other bug fixes:
Start migrating to a unified vocabulary for "annotations that are being compared" vs the "ground truth to compare against":
"annotator" is an in-evaluation set of annotations
"truth" (or "ground truth") is the gold standard comparison
Other terms like "gold" or "reviewer" will be phased out to reduce confusion
When calculating a confusion matrix, we used to throw away any labels that weren't in the truth matrix. But that ignored False Positives in the annotator matrix for labels that truth never used. So now we only throw away labels that weren't in either matrix.
When calculating accuracy, we previously reversed truth and annotator. This PR fixes that (so specificity and sensitivity will be reversed from before this PR -- to their correct values).
When calculating accuracy, only deal with simply the intersection of note ranges between annotator and truth. (before we always used the annotator's note range, which might have notes not in truth)
When merging two sets of annotations (like from Label Studio and an external set of labels), don't error out if Label Studio has notes not in the external set and don't ignore annotations that have more than one label in them (which happens with our generated annotations for external label sets, whoops)
Updated unit tests and manually confirmed TN/FP/etc stats are right.
I originally wrote it that way to imitate some code from covid_symptom/paper.py that was doing three-way comparisons.
But we often have need of simple two-way comparisons. And even my three-way comparison may not have been the best way to compare (paper.py might have been doing something a little unique, where it was actually doing two different two-way comparisons and I misunderstood intent).
So I intend to bring back multi-way comparison at some point, but it will with more design input from a clinical person. But for now, two-way is actively useful, so let's start with that.
In addition, there were some other bug fixes: