Closed ninsch3000 closed 2 years ago
The issue is still relevant. Predictions that do not overlap with known sites should be kept, using perhaps NA for the fields that are not available.
I think we have several options (see attached picture):
As for benchmarking, in options 2, 3 and 4:
In option 5:
I had a look at the script and I think unmatched sites are actually recorded, so the "issue" mentioned in the README is NOT relevant. However, I think @daneckaw 's thoughts are still very valuable for correct handling of the site matching. Within the code itself there are also two passages marked as "not implemented yet" (things about non-unique matches, if I understand correctly), so it would be great if someone can re-visit this script, implement passages that are necessary (if applicable, else discard) and update the README so that naive readers are not misled.
Thanks for the follow up @ninsch3000, apologies for the long delay on mine! Agree on the unmatched sites point, they are definitely reported in the output of match_with_gt
. I think we could be more specific in the README though so use-cases are clearer for people writing summary workflows e.g.
bedtools_window
only returns matched sites by default. To return unmatched sites only, callbedtools_window
withreverse=True
I guess another clarification to make here is that unmatched sites are discarded for the correlation calculation (See line 120 here). That is probably the behaviour we want, otherwise tools with more unmatched sites will have their correlation value strongly down-weighted (as unmatched site expression is set to 0 in match_wtih_gt
). As we discussed in last meeting I still think the additional information of the number/fraction of unmatched sites will be useful to provide alongside the correlation value if OpenEBench/the summary workflows can accommodate that (my main argument in favour of this is that not all quantification tools are assessed in the identification challenge).
What seems to be missing is the following:
match_wtih_gt
(typo there in the function name we can fix)corr_with_gt
Also agreeing with @ninsch3000, I think we should indicate that this code can and should be used for both the identification & quantification benchmarking events/summary workflows so underlying functions used are identical. We could modify match_wtih_gt
slightly so it can optionally just return a dataframe of matched / non-matched without calculating expression weights (although this isn't done currently as mentioned above). That way the same function could be used for both identification & quantification summary workflows (with just a change in flag).
Unmatched sites in prediction are reported, but if I can see correctly, unmatched sites in ground truth (no predicted sites found for a given ground truth site) are ignored, and I think that's something we'd like to include in correlation calculation. So, for correlation we'd use:
And sites in predicted that don't have a match in ground truth are reported as unmatched and plotted as a separate result.
What seems to be missing is the following:
- Splitting the expression of predicted site if it matches multiple ground truth sites (weighted by proximity - I'm not exactly sure how exactly we'd do this) - match_wtih_gt (typo there in the function name we can fix)
- Summing the expression of all predicted sites that match a single ground truth site (this should be trivial) - corr_with_gt
For 1, I think the weights might be proportional to the distance from GT site. I wrote a draft for both - I just need to test it a bit more and I can add a commit to the summary workflow PR or a separate PR.
I think this code will work for identification too, but it might be a bit too complicated? In identification we don't need to deal with multiple matched sites, but we need to handle unmatched sites. I wrote a draft PR for a simplified version ages ago here. It has the assumption that PAS names are unique, and I need to change that, but other than that it seemed to work fine as far as I remember. Of course the entire OEB part needs to be updated.
Hi @daneckaw , sorry for the late reply. I think creating a new PR (from a new branch) would do the trick. Thanks a lot for dealing with this!
The README for this script mentions an issue. Check whether this is still open/relevant and either solve it, remove the comment, or update the comment (e.g. "will be deliberately ignored for APAeval", etc.)
UPDATE:
I had a look at the script and I think unmatched sites are actually recorded, so the "issue" mentioned in the README is NOT relevant. However, I think @daneckaw 's thoughts are still very valuable for correct handling of the site matching. Within the code itself there are also two passages marked as "not implemented yet" (things about non-unique matches, if I understand correctly), so it would be great if someone can re-visit this script, implement passages that are necessary (if applicable, else discard) and update the README so that naive readers are not misled.
This issue here is really about making sure that our core module for matching predicted and ground truth sites is implemented correctly. Furthermore we have to indicate whether this can be used in the quantification AND identification benchmarking events/summary workflows, or whether we need to implement additional functionality for either of them.