iRNA-COSI / APAeval

Community effort to evaluate computational methods for the detection and quantification of poly(A) sites and estimating their differential usage across RNA-seq samples
MIT License
13 stars 14 forks source link

POC of relative usage quantification summary workflow #399

Closed faricazjj closed 2 years ago

faricazjj commented 2 years ago

@SamBryce-Smith's POC and comments: https://docs.google.com/document/d/1t30SEHZ6RSFYXG7o1ylWCfgYWVbZCA1jj-SkES1M2rw/edit?usp=sharing

Sam edit: previous issue/discussion #278

mrgazzara commented 2 years ago

Also see Yoseph's summary document giving a road map for filtering / metrics here: https://docs.google.com/document/d/1u18BSWB2776d-EXAoft7Je5KmhC1aGMFZJRpJ1rueNI/edit

How do we accomplish this? The first task (will open a separate issue) is to create a filtering util script to define terminal exons and filter the ground truth files down. The output of this filtering will leave the ground truth files with a binary distal and proximal PAS for every terminal exon and some terminal exon (TE) DB/summary file of those terminal exons. Let's leave the logistics of that filtering script to the separate issue. Here let's just discuss what to do with the actual metrics to be calculated with the relative usage quantification summary workflow.

I will, however, assume there is some new information in the resulting ground truth filtered BED files to help in the calculations of the metrics, mainly in the name field. This is open for discussion (@SamBryce-Smith any additional requirements / suggestions would be great), but I think the most useful information here will be a unique terminal exon ID (some combination of Gene ID, Transcript ID(s), and maybe event the final TE coordinates after potentially merging overlapping TEs together) combined with either a dpas or ppas suffix (for distal or proximal polyA sites, respectively)). An example terminal exon ID could be something like: ENSG00000091157|ENST00000254442.8;ENST00000357574.7;ENST00000593058.1|57027015-57029811_dpas

Finally, to accomplish calculation of the metrics discussed in the above document, which I also outline below, it would be very helpful to have similar terminal exon ID matches for the EWF output. I purpose we also match EWF relative usage output to the TE summary file to simply assign terminal exon IDs to them.

So let's assume that util script is done and we've filtered the ground truth and applied similar TE IDs to the EWF output predicted PAS. The quantification summary workflow would then take as input these new, filtered BED files from the ground truth and the EWF output, perform the matching over our defined windows, and calculate/count the following:

Given the above matching we can then simply reuse the same correlations done with absolute TPM quantifications, just with a filtering to correlate relative usage between only one of the two ground truth pas values. Perhaps always dpas (or 1 - ppas if it was not matched)? We also want to further subset these correlations based on the degree of matching:

  1. Where both sites matched correctly
  2. Where at least the proximal site matched correctly
  3. Where at least the distal site matched correctly.

A final metric (similar to tau used in the GB benchmarking paper) would be to sum over the absolute deviation in relative usage (ground truth versus EWF output) for each of the above 3 listed sets.

Thoughts and feedback appreciated, particularly on the proposed ID scheme which I think will facilitate grabbing a lot of these metrics. With this at least we can start thinking about how to implement these metrics as I finish writing up the util scripts.

SamBryce-Smith commented 2 years ago

Practical comment not related to @mrgazzara's points, I've had a go at implementing some of these things myself, mainly because I really wanted to explore the impacts of our filtering criteria with some real data. Was hoping it could inform selecting metrics/filtering criteria. It's all in one notebook so not necessarily flexible/translatable to utils scripts like you suggest, but wanted to note this somewhere before going ahead. I think I'm roughly following your suggested workflow, but I'm keeping everything in memory (using PyRanges package) so haven't had thoughts about how to output in BED/conventional formats yet. Again this should really be copied into a separate issue as you suggest.

I still need to tidy this up so it's defo WIP: https://github.com/iRNA-COSI/APAeval/tree/poc_relative_quant/summary_workflows/relative_quantification/poc

SamBryce-Smith commented 2 years ago

See #413 for further discussion/a formal definition of my proposed 'terminal exon ID'

SamBryce-Smith commented 2 years ago

I've reached a stopping point with the proof-of-concept notebook. It doesn't implement absolutely everything but covers the general approach, so take the conclusions with a pinch of salt.

Can view here - tested the implementation on a single Mayr sample with outputs from DaPars & DaPars2. Plot below is an example of one of Joseph's suggested metrics, a cumulative distribution of the absolute differences of distal site relative usage for each 'match group'. image

It's a bit of a behemoth at this point as I also took it as an opportunity to explore some different decisions etc. so I wouldn't be surprised if you find it tricky to decipher. Any questions just write to me on Slack.

An inexhaustive list of things I missed (but will be implemented in summary workflow):

Key points/take homes

Suggested prioritisation scheme for 'multiple prediction group matches'

These events are quite widespread. Screenshot below is computing the combination of distinct 'match groups' for terminal exons with multiple 'prediction group' matches with DaPars output. image

We have 4 possible match groups, 'both' (both sites match within window), 'distal' (only distal GT site matches within window), 'proximal' (only proximal GT site matches within window) & 'neither' (neither GT site matches within window).

Given the following assumptions:

I implemented the following prioritisation scheme (1 = 'most important'): Both > Distal > Proximal > Neither

  1. If a ground truth TE has matches to both sites, keep all prediction groups that have matches to both sites and discard the rest.
  2. ground truth TE has matches to 'distal' site & groups that match 'neither' site, keep all 'distal' site match prediction groups and discard the rest
  3. prediction group matches classified as 'distal' & 'proximal', arbitrarily prefer the 'distal' matches and discard the rest
    • We are evaluating the deviation from ground truth distal site usage so prefer to compare to predicted sites that match
  4. Ground truth TE has matches classified as 'proximal' at best - keep all prediction groups classed as proximal and discard the rest. Calculate 1 - proximal usage as the estimated distal site usage for prediction and ground truth
  5. Classify as 'neither'

The above prioritisation gives a single 'match group' classification to each ground truth terminal exon, but still leaves many TEs with multiple 'prediction group' matches. In selecting a single 'match' for each terminal exon I decided to be as nice as possible to the tools:

  1. Calculate the absolute difference between predicted and ground truth relative usage for all matches.
  2. For each terminal exon, choose the prediction group with the smallest absolute difference to ground truth as the 'representative match' for that ground truth TE.

Don't have to follow above suggestion, but I think it's fairly simple, 'makes sense' and is as nice to tools as we can be. At very least wanted to document the new things I did to save having to sift through the notebook!

Any questions/comments just ping me on Slack :) good luck with the final push everyone!