Closed faricazjj closed 2 years ago
Also see Yoseph's summary document giving a road map for filtering / metrics here: https://docs.google.com/document/d/1u18BSWB2776d-EXAoft7Je5KmhC1aGMFZJRpJ1rueNI/edit
How do we accomplish this? The first task (will open a separate issue) is to create a filtering util script to define terminal exons and filter the ground truth files down. The output of this filtering will leave the ground truth files with a binary distal and proximal PAS for every terminal exon and some terminal exon (TE) DB/summary file of those terminal exons. Let's leave the logistics of that filtering script to the separate issue. Here let's just discuss what to do with the actual metrics to be calculated with the relative usage quantification summary workflow.
I will, however, assume there is some new information in the resulting ground truth filtered BED
files to help in the calculations of the metrics, mainly in the name
field. This is open for discussion (@SamBryce-Smith any additional requirements / suggestions would be great), but I think the most useful information here will be a unique terminal exon ID
(some combination of Gene ID
, Transcript ID(s)
, and maybe event the final TE coordinates after potentially merging overlapping TEs together) combined with either a dpas
or ppas
suffix (for distal or proximal polyA sites, respectively)). An example terminal exon ID
could be something like: ENSG00000091157|ENST00000254442.8;ENST00000357574.7;ENST00000593058.1|57027015-57029811_dpas
Finally, to accomplish calculation of the metrics discussed in the above document, which I also outline below, it would be very helpful to have similar terminal exon ID
matches for the EWF output. I purpose we also match EWF relative usage output to the TE summary file to simply assign terminal exon IDs
to them.
So let's assume that util script is done and we've filtered the ground truth and applied similar TE IDs to the EWF output predicted PAS. The quantification summary workflow would then take as input these new, filtered BED
files from the ground truth and the EWF output, perform the matching over our defined windows, and calculate/count the following:
name
field)dpas
nor ppas
match (i.e. APA was detected, just PAS locations are off). dpas
or ppas
matchdpas
and ppas
matchppas
and dpas
sites that had multiple predicted sites matched to it from the EWF output. Given the above matching we can then simply reuse the same correlations done with absolute TPM quantifications, just with a filtering to correlate relative usage between only one of the two ground truth pas
values. Perhaps always dpas
(or 1 - ppas
if it was not matched)? We also want to further subset these correlations based on the degree of matching:
A final metric (similar to tau used in the GB benchmarking paper) would be to sum over the absolute deviation in relative usage (ground truth versus EWF output) for each of the above 3 listed sets.
Thoughts and feedback appreciated, particularly on the proposed ID scheme which I think will facilitate grabbing a lot of these metrics. With this at least we can start thinking about how to implement these metrics as I finish writing up the util scripts.
Practical comment not related to @mrgazzara's points, I've had a go at implementing some of these things myself, mainly because I really wanted to explore the impacts of our filtering criteria with some real data. Was hoping it could inform selecting metrics/filtering criteria. It's all in one notebook so not necessarily flexible/translatable to utils scripts like you suggest, but wanted to note this somewhere before going ahead. I think I'm roughly following your suggested workflow, but I'm keeping everything in memory (using PyRanges package) so haven't had thoughts about how to output in BED/conventional formats yet. Again this should really be copied into a separate issue as you suggest.
I still need to tidy this up so it's defo WIP: https://github.com/iRNA-COSI/APAeval/tree/poc_relative_quant/summary_workflows/relative_quantification/poc
See #413 for further discussion/a formal definition of my proposed 'terminal exon ID'
I've reached a stopping point with the proof-of-concept notebook. It doesn't implement absolutely everything but covers the general approach, so take the conclusions with a pinch of salt.
Can view here - tested the implementation on a single Mayr sample with outputs from DaPars & DaPars2. Plot below is an example of one of Joseph's suggested metrics, a cumulative distribution of the absolute differences of distal site relative usage for each 'match group'.
It's a bit of a behemoth at this point as I also took it as an opportunity to explore some different decisions etc. so I wouldn't be surprised if you find it tricky to decipher. Any questions just write to me on Slack.
An inexhaustive list of things I missed (but will be implemented in summary workflow):
A 'prediction group' identifier is critical for tools like DaPars that output predictions transcript by transcript, and as such can have multiple predicted matches (and different predicted relative usages) to the same site.
Need a prioritisation scheme to select a 'representative match' for ground truth terminal exons that have matches to multiple 'prediction groups'.
These events are quite widespread. Screenshot below is computing the combination of distinct 'match groups' for terminal exons with multiple 'prediction group' matches with DaPars output.
We have 4 possible match groups, 'both' (both sites match within window), 'distal' (only distal GT site matches within window), 'proximal' (only proximal GT site matches within window) & 'neither' (neither GT site matches within window).
Given the following assumptions:
I implemented the following prioritisation scheme (1 = 'most important'): Both > Distal > Proximal > Neither
The above prioritisation gives a single 'match group' classification to each ground truth terminal exon, but still leaves many TEs with multiple 'prediction group' matches. In selecting a single 'match' for each terminal exon I decided to be as nice as possible to the tools:
Don't have to follow above suggestion, but I think it's fairly simple, 'makes sense' and is as nice to tools as we can be. At very least wanted to document the new things I did to save having to sift through the notebook!
Any questions/comments just ping me on Slack :) good luck with the final push everyone!
@SamBryce-Smith's POC and comments: https://docs.google.com/document/d/1t30SEHZ6RSFYXG7o1ylWCfgYWVbZCA1jj-SkES1M2rw/edit?usp=sharing
Sam edit: previous issue/discussion #278