iRNA-COSI / APAeval

Community effort to evaluate computational methods for the detection and quantification of poly(A) sites and estimating their differential usage across RNA-seq samples
MIT License
13 stars 14 forks source link

Re-visit `Leo_matchPAsites_GT.py` #214

Closed ninsch3000 closed 2 years ago

ninsch3000 commented 2 years ago

The README for this script mentions an issue. Check whether this is still open/relevant and either solve it, remove the comment, or update the comment (e.g. "will be deliberately ignored for APAeval", etc.)

UPDATE:

I had a look at the script and I think unmatched sites are actually recorded, so the "issue" mentioned in the README is NOT relevant. However, I think @daneckaw 's thoughts are still very valuable for correct handling of the site matching. Within the code itself there are also two passages marked as "not implemented yet" (things about non-unique matches, if I understand correctly), so it would be great if someone can re-visit this script, implement passages that are necessary (if applicable, else discard) and update the README so that naive readers are not misled.

This issue here is really about making sure that our core module for matching predicted and ground truth sites is implemented correctly. Furthermore we have to indicate whether this can be used in the quantification AND identification benchmarking events/summary workflows, or whether we need to implement additional functionality for either of them.

mzavolan commented 2 years ago

The issue is still relevant. Predictions that do not overlap with known sites should be kept, using perhaps NA for the fields that are not available.

daneckaw commented 2 years ago

I think we have several options (see attached picture):

  1. Ignore all non-matching sites because correct PAS identification is benchmarked in identification challenges
  2. Group all sites for one gene as non-matching sites for this gene (here Pred_2+Pred_3+Pred_5 would be grouped together)
  3. Group all sites between correctly identified sites (Pred_2+Pred_3 would be grouped together; Pred_5 remains separate)
  4. Consider every non-matched site in Pred separately
  5. Group all non-matched sites globally as one

non_matched_PAS

As for benchmarking, in options 2, 3 and 4:

In option 5:

ninsch3000 commented 2 years ago

UPDATE:

I had a look at the script and I think unmatched sites are actually recorded, so the "issue" mentioned in the README is NOT relevant. However, I think @daneckaw 's thoughts are still very valuable for correct handling of the site matching. Within the code itself there are also two passages marked as "not implemented yet" (things about non-unique matches, if I understand correctly), so it would be great if someone can re-visit this script, implement passages that are necessary (if applicable, else discard) and update the README so that naive readers are not misled.

This issue here is really about making sure that our core module for matching predicted and ground truth sites is implemented correctly. Furthermore we have to indicate whether this can be used in the quantification AND identification benchmarking events/summary workflows, or whether we need to implement additional functionality for either of them.

SamBryce-Smith commented 2 years ago

Thanks for the follow up @ninsch3000, apologies for the long delay on mine! Agree on the unmatched sites point, they are definitely reported in the output of match_with_gt. I think we could be more specific in the README though so use-cases are clearer for people writing summary workflows e.g.

bedtools_window only returns matched sites by default. To return unmatched sites only, call bedtools_window with reverse=True

I guess another clarification to make here is that unmatched sites are discarded for the correlation calculation (See line 120 here). That is probably the behaviour we want, otherwise tools with more unmatched sites will have their correlation value strongly down-weighted (as unmatched site expression is set to 0 in match_wtih_gt). As we discussed in last meeting I still think the additional information of the number/fraction of unmatched sites will be useful to provide alongside the correlation value if OpenEBench/the summary workflows can accommodate that (my main argument in favour of this is that not all quantification tools are assessed in the identification challenge).

What seems to be missing is the following:

  1. Splitting the expression of predicted site if it matches multiple ground truth sites (weighted by proximity - I'm not exactly sure how exactly we'd do this) - match_wtih_gt (typo there in the function name we can fix)
  2. Summing the expression of all predicted sites that match a single ground truth site (this should be trivial) - corr_with_gt

Also agreeing with @ninsch3000, I think we should indicate that this code can and should be used for both the identification & quantification benchmarking events/summary workflows so underlying functions used are identical. We could modify match_wtih_gt slightly so it can optionally just return a dataframe of matched / non-matched without calculating expression weights (although this isn't done currently as mentioned above). That way the same function could be used for both identification & quantification summary workflows (with just a change in flag).

daneckaw commented 2 years ago

Unmatched sites in prediction are reported, but if I can see correctly, unmatched sites in ground truth (no predicted sites found for a given ground truth site) are ignored, and I think that's something we'd like to include in correlation calculation. So, for correlation we'd use:

And sites in predicted that don't have a match in ground truth are reported as unmatched and plotted as a separate result.

What seems to be missing is the following:

  1. Splitting the expression of predicted site if it matches multiple ground truth sites (weighted by proximity - I'm not exactly sure how exactly we'd do this) - match_wtih_gt (typo there in the function name we can fix)
  2. Summing the expression of all predicted sites that match a single ground truth site (this should be trivial) - corr_with_gt

For 1, I think the weights might be proportional to the distance from GT site. I wrote a draft for both - I just need to test it a bit more and I can add a commit to the summary workflow PR or a separate PR.

I think this code will work for identification too, but it might be a bit too complicated? In identification we don't need to deal with multiple matched sites, but we need to handle unmatched sites. I wrote a draft PR for a simplified version ages ago here. It has the assumption that PAS names are unique, and I need to change that, but other than that it seemed to work fine as far as I remember. Of course the entire OEB part needs to be updated.

dominikburri commented 2 years ago

Hi @daneckaw , sorry for the late reply. I think creating a new PR (from a new branch) would do the trick. Thanks a lot for dealing with this!