Double checking accuracy of predictions -- statistics window false positive counts

maxieds commented 5 years ago

@ceheitsch has suggested that we double check the false positive counts in the stats window. The two screenshots below correspond to the relevant windows for C. elegans:

Screenshot at 2019-03-10 23-42-11

annakirkpatrick commented 5 years ago

@maxieds What exactly do you need help with here? (Or, alternatively, what don't you understand?)

As I have explained, I have many other responsibilities and am currently dealing with a chronic pain flare. I do not have the bandwidth (or, at the moment, physical capacity) to write some scripts to parse the structures and recalculate these values.

maxieds commented 5 years ago

@annakirkpatrick This is one of the checking tasks that @ceheitsch flagged for me to enter two Wednesdays ago. It's not urgent. What am I looking to check here?

annakirkpatrick commented 5 years ago

@maxieds I think what you're missing here are just some definitions. Let me know if I'm wrong.

You should think of all of this as set theory. Base pairs are the elements of our sets.

True positives = base pairs in reference structure and predicted structure False positives = base pairs in predicted structure but not in reference structure False negatives = base pairs in reference structure but not in predicted structure

I think Christine wanted you to check that those values are indeed being calculated according to the above definitions. You probably want to select a simpler (shorter) sequence if you want to check these values by hand. You could also just do a little scripting to extract these sets of base pairs from the various CT files, take the appropriate intersections and set differences yourself, and compare to the results that RNAStructViz returns. And if anything doesn't match, of course you then need to look back at the code that generates these bar graphs.

maxieds commented 5 years ago

@ceheitsch This issue should now be resolved. See the screenshot below for the small mock example I created to test these calculations:

Screenshot at 2019-07-20 02-45-55

These correspond to the two following DB (DOT) files:

> Another C. Elegans sample
UAAAGUUUUCUUUCAGGGAAUUAAAAUUUGAUCAUGGUUU
().((((..(.....)..)))).......(((((.)))))

> Start of C. Elegans
UAAAGUUUUCUUUCAGGGAAUUAAAAUUUGAUCAUGGUUU
().....................(((((.......)))))

I believe that the discrepancies Christine noticed is not an actual computational error (as in the formulas are coded incorrectly in the source), but rather a misleading initial count for TP/FP/FN corresponding to the reference structure. Observe, in the True Positives graph above, even though there is only one TP to count in the predicted structures, there is still the total possible count of 11 in black. This is not the best way to do things, but for the historical users, let's leave it as is. I added a clarifying legend image on the RHS of the StatsWindow to explain (as @annakirkpatrick defined above) to users what these bars and counts actually reflect.

Now closing this long standing issue since it has been tested and deemed resolved sanely.

gtDMMB / RNAStructViz

Double checking accuracy of predictions -- statistics window false positive counts #32