POC of relative usage quantification summary workflow

faricazjj commented 2 years ago

@SamBryce-Smith's POC and comments: https://docs.google.com/document/d/1t30SEHZ6RSFYXG7o1ylWCfgYWVbZCA1jj-SkES1M2rw/edit?usp=sharing

Sam edit: previous issue/discussion #278

mrgazzara commented 2 years ago

Also see Yoseph's summary document giving a road map for filtering / metrics here: https://docs.google.com/document/d/1u18BSWB2776d-EXAoft7Je5KmhC1aGMFZJRpJ1rueNI/edit

How do we accomplish this? The first task (will open a separate issue) is to create a filtering util script to define terminal exons and filter the ground truth files down. The output of this filtering will leave the ground truth files with a binary distal and proximal PAS for every terminal exon and some terminal exon (TE) DB/summary file of those terminal exons. Let's leave the logistics of that filtering script to the separate issue. Here let's just discuss what to do with the actual metrics to be calculated with the relative usage quantification summary workflow.

I will, however, assume there is some new information in the resulting ground truth filtered BED files to help in the calculations of the metrics, mainly in the name field. This is open for discussion (@SamBryce-Smith any additional requirements / suggestions would be great), but I think the most useful information here will be a unique terminal exon ID (some combination of Gene ID, Transcript ID(s), and maybe event the final TE coordinates after potentially merging overlapping TEs together) combined with either a dpas or ppas suffix (for distal or proximal polyA sites, respectively)). An example terminal exon ID could be something like: ENSG00000091157|ENST00000254442.8;ENST00000357574.7;ENST00000593058.1|57027015-57029811_dpas

Finally, to accomplish calculation of the metrics discussed in the above document, which I also outline below, it would be very helpful to have similar terminal exon ID matches for the EWF output. I purpose we also match EWF relative usage output to the TE summary file to simply assign terminal exon IDs to them.

So let's assume that util script is done and we've filtered the ground truth and applied similar TE IDs to the EWF output predicted PAS. The quantification summary workflow would then take as input these new, filtered BED files from the ground truth and the EWF output, perform the matching over our defined windows, and calculate/count the following:

a) The number of unique TE in the filtered ground truth (already filtered for APA, and expression by util script)
b) The number of unique TE in the altered EWF output (matched to TE database by util script to update name field)
c) The number/fraction of unique TE in the altered EWF output not present in the filtered ground truth file (putative False Positives)
d) The number/fraction of filtered ground truth TEs from (a) that had no matches in the EWF output (False Negatives)
e) The number/fraction of filtered ground truth TEs from (a) that were matched in the altered EWF output in (b) for the given window.
- i) neither dpas nor ppas match (i.e. APA was detected, just PAS locations are off).
- ii) either dpas or ppas match
- iiI) both dpas and ppas match
f) The Jaccard index at the TE level for the ground truth and EWF outputs ((e) over (c) + (d) + (e))
g) Fraction of total ground truth ppas and dpas sites that had multiple predicted sites matched to it from the EWF output.

Given the above matching we can then simply reuse the same correlations done with absolute TPM quantifications, just with a filtering to correlate relative usage between only one of the two ground truth pas values. Perhaps always dpas (or 1 - ppas if it was not matched)? We also want to further subset these correlations based on the degree of matching:

Where both sites matched correctly
Where at least the proximal site matched correctly
Where at least the distal site matched correctly.

A final metric (similar to tau used in the GB benchmarking paper) would be to sum over the absolute deviation in relative usage (ground truth versus EWF output) for each of the above 3 listed sets.

Thoughts and feedback appreciated, particularly on the proposed ID scheme which I think will facilitate grabbing a lot of these metrics. With this at least we can start thinking about how to implement these metrics as I finish writing up the util scripts.

SamBryce-Smith commented 2 years ago

Practical comment not related to @mrgazzara's points, I've had a go at implementing some of these things myself, mainly because I really wanted to explore the impacts of our filtering criteria with some real data. Was hoping it could inform selecting metrics/filtering criteria. It's all in one notebook so not necessarily flexible/translatable to utils scripts like you suggest, but wanted to note this somewhere before going ahead. I think I'm roughly following your suggested workflow, but I'm keeping everything in memory (using PyRanges package) so haven't had thoughts about how to output in BED/conventional formats yet. Again this should really be copied into a separate issue as you suggest.

I still need to tidy this up so it's defo WIP: https://github.com/iRNA-COSI/APAeval/tree/poc_relative_quant/summary_workflows/relative_quantification/poc

SamBryce-Smith commented 2 years ago

See #413 for further discussion/a formal definition of my proposed 'terminal exon ID'

SamBryce-Smith commented 2 years ago

I've reached a stopping point with the proof-of-concept notebook. It doesn't implement absolutely everything but covers the general approach, so take the conclusions with a pinch of salt.

Can view here - tested the implementation on a single Mayr sample with outputs from DaPars & DaPars2. Plot below is an example of one of Joseph's suggested metrics, a cumulative distribution of the absolute differences of distal site relative usage for each 'match group'.

It's a bit of a behemoth at this point as I also took it as an opportunity to explore some different decisions etc. so I wouldn't be surprised if you find it tricky to decipher. Any questions just write to me on Slack.

An inexhaustive list of things I missed (but will be implemented in summary workflow):

Checking that selected ground truth sites are not within window size distance of one another (implemented in filterPAS module)
Checking that minor site of selected pair has at least x % of total expression (e.g. 5 %) (implemented in filterPAS module)
Jaccard index metrics
Correlation of relative usage

Key points/take homes

A 'prediction group' identifier is critical for tools like DaPars that output predictions transcript by transcript, and as such can have multiple predicted matches (and different predicted relative usages) to the same site.
- Typically the transcript ID is sufficient. I have proposed to include this in the 'terminal exon ID' (#413). We need to double check that execution workflow outputs include enough information to group predictions (DaPars & DaPars2 contain transcript ID, QAPA is not really affected by this issue, PAQR and APAtrap to check?)
Need a prioritisation scheme to select a 'representative match' for ground truth terminal exons that have matches to multiple 'prediction groups'.

Suggested prioritisation scheme for 'multiple prediction group matches'

These events are quite widespread. Screenshot below is computing the combination of distinct 'match groups' for terminal exons with multiple 'prediction group' matches with DaPars output.

We have 4 possible match groups, 'both' (both sites match within window), 'distal' (only distal GT site matches within window), 'proximal' (only proximal GT site matches within window) & 'neither' (neither GT site matches within window).

Given the following assumptions:

We evaluate against distal site relative usage (as filtered ground truth only contains 2 sites we can choose one/other).
We prefer more matches to ground truth

I implemented the following prioritisation scheme (1 = 'most important'): Both > Distal > Proximal > Neither

If a ground truth TE has matches to both sites, keep all prediction groups that have matches to both sites and discard the rest.
ground truth TE has matches to 'distal' site & groups that match 'neither' site, keep all 'distal' site match prediction groups and discard the rest
prediction group matches classified as 'distal' & 'proximal', arbitrarily prefer the 'distal' matches and discard the rest
- We are evaluating the deviation from ground truth distal site usage so prefer to compare to predicted sites that match
Ground truth TE has matches classified as 'proximal' at best - keep all prediction groups classed as proximal and discard the rest. Calculate 1 - proximal usage as the estimated distal site usage for prediction and ground truth
Classify as 'neither'

The above prioritisation gives a single 'match group' classification to each ground truth terminal exon, but still leaves many TEs with multiple 'prediction group' matches. In selecting a single 'match' for each terminal exon I decided to be as nice as possible to the tools:

Calculate the absolute difference between predicted and ground truth relative usage for all matches.
For each terminal exon, choose the prediction group with the smallest absolute difference to ground truth as the 'representative match' for that ground truth TE.

Don't have to follow above suggestion, but I think it's fairly simple, 'makes sense' and is as nice to tools as we can be. At very least wanted to document the new things I did to save having to sift through the notebook!

Any questions/comments just ping me on Slack :) good luck with the final push everyone!

iRNA-COSI / APAeval

POC of relative usage quantification summary workflow #399

Key points/take homes

Suggested prioritisation scheme for 'multiple prediction group matches'