POC of relative expression calculation in execution workflow

faricazjj commented 2 years ago

Proof of concept for relative expression calculation in execution workflows

POC content In this issue, we discussed the following:

Decided which potential tools to calculate relative expression for and agreed on the following: APAlyzer QAPA APAtrap DaPars DaPars2 LABRAT PAQR IsoSCM
A summary of how are the tools outputting relative expression
How to approach relative expression calculation and what the discrepancies are between tools to consider when matching ground truth data with relative usage output from tools later on in summary workflow. Are the outputs from these tools feasible to be used in summary workflow?
Testing the tools to see if the outputs are able to qualify for relative quantification (i.e. which results to extract and how to calculate relative quantification from the results). Some tools in the list above are more straightforward when outputting relative quantification. However, other tools are more involved and require some discussion. The tools discussed in this issue are: APAtrap TAPAS APAlyzer

Outcome To keep track of the execution workflows to be updated and implemented to include relative quantification, we have created a child issue a checklist of tools to be implemented.

Child issue: https://github.com/iRNA-COSI/APAeval/issues/382

Estimate: 5d

SamBryce-Smith commented 2 years ago

Following up on this (rather late I'm afraid so sorry for that), feeling an inevitable essay coming up. I'm starting with the assumption that we're leaning towards matching our ground truth data to the relative expression outputs of each tool. This is so we can include more tools in the quantification challenge that don't quite fit our per-PAS TPM model due to not reporting per-PAS quant (DaPars(2)) or having a different metric to pas vs all other pas (e.g. LABRAT's gene-level psi).

NB: Have discussed a lot of these points with @mrgazzara but these are pretty much entirely my opinions/thoughts. I can break down more specifically if people have any queries/like this approach.

Tool summary:

2 types of metrics reported by tools - region level (1 value per terminal exon/gene) or site level
Region level tools - APAlyzer, APAtrap (?), DaPars2, DaPars, LABRAT, PAQR (not yet implemented in Ewf), Roar (not merged)
Site level tools - QAPA, PAQR, Dapars & DaPars2 (with some simple post-processing, see notes)
TBD - CSI-UTR, IsoSCM, diffUTR, TAPAS (not merged), GETUTR (not merged)

Summary of my road-map:

Site-level (1 site vs others) fractional relative usage calculation is already implemented, just need to adapt so we look for PAS in the right region (e.g. only terminal exon)
Propose new BED specifications for relative usage execution workflow outputs (branch 'update_ewf_specs' 88b512030e8cca0aaa942637a731f17f67bcc940)
- 1 - polyA coordinate BED where score field is relative usage value
- 2 - region (e.g. terminal exon or gene) coordinate BED if per-PAS BED (1) generated
- 3 - region coordinate bed where score field is relative usage value
Adapt existing code to calculate fractional relative usage for per PAS-reporting tools
Tool-specific functions for those reporting a 'region-level'/non-standard metric (e.g. APAlyzer, LABRAT)
Output results for visualisation are correlation value (so common across tools)

1. New output files for execution workflows to capture the relative expression values (branch 'update_ewf_specs' https://github.com/iRNA-COSI/APAeval/commit/88b512030e8cca0aaa942637a731f17f67bcc940)

I suggest separate BED files for region & per-PAS relative expression metrics - have updated specs on branch update_ewf_specs with what I have in mind.
- I also add a 'region BED file', which just reports the region containing the identified polyA site. This can be the terminal exon or the gene coordinates. I think this will make downstream calculations a little easier (don't need to fish through GTF for terminal exons).
  - Would only need this file if per PAS relative expression is reported

2. Calculate relative usage of ground truth in summary workflows

Per-PAS fractional relative usage can use more or less the same code @dominikburri has written, perhaps with some slight adaptations to e.g. only calculate for ground truth sites, use the 'region BED' instead of the GTF for extracting polyA sites (this is maybe not so necessary).

For 'region-level' relative usage tools:

Only need new specific implementations for region-level tools (APAlyzer & LABRAT for definite as of now).
Matching ground truth PAS to the region is the same for all tools (A simple overlap really). Just need to make sure GT sites are labelled somehow with the overlapping predicted region.
Implement specific functions to calculate region-level relative usage for each tool. Use the participant_id in the nextflow.config to pick the right function to calculate relative expression
Correlation calculation as before

Notes:

For the two DaPars tools, we could get a site level relative usage with a bit of post-processing. They report a fraction distal site used. Since both tools use a 2 polyA site model, we could get the proximal PAS usage by 1 - distal usage.

dominikburri commented 2 years ago

Hi @SamBryce-Smith, many thanks for the detailed descriptions! I like the approach to split relative usage into site-based and region-based. Maybe I missed it, but how would the prediction be compared against the ground truth? Similar to the TPM-based quantification with correlation? That is, what would be the metrics? And when we need to compute relative usage for the ground truth, wouldn't some tools benefit more than others? Because of e.g. how the relative usage is computed? or which information is incorporated? Maybe my comments are out of hands, so please let me know!

faricazjj commented 2 years ago

Hi @SamBryce-Smith @mrgazzara! I looked into APAtrap on how to get relative usage output here are the explanations I came up with.

APAtrap APAtrap's original absolute TPM quantification output for control_r1 looks like the following:

For each transcript id, we have a list of all identified PAS and their expression levels from the most proximal to most distal if the strand is +, and the opposite for a - strand (source: APAtrap wiki).

The raw output above is then transformed into the following quantification output file:

To get relative usage output, we would get each PAS in the transcript divided by the total sum of expression levels of all the PAS in the same transcript. For example, the screenshot above of the original output would be converted to the following:

Taking the example of ENSMUST00000023572.14, from the quantification output file, we get the following:

ENSMUST00000023572.14|16|78336188-78340752|+    0.0
ENSMUST00000023572.14|16|78336188-78340752|+    50.15
ENSMUST00000023572.14|16|78336188-78340752|+    35.57

In the relative quantification output file, the score column of the corresponding rows become:

ENSMUST00000023572.14|16|78336188-78340752|+    0.0
ENSMUST00000023572.14|16|78336188-78340752|+    0.5850443303779748 
ENSMUST00000023572.14|16|78336188-78340752|+    0.4149556696220252

Does this look right and should we go ahead with this implementation?

SamBryce-Smith commented 2 years ago

Thanks @faricazjj, that also looks correct to me! Just one comment; in the past we have 'disqualified' tools from challenges because they don't 'directly' provide the required output. This was also our initial starting point for the relative expression challenge, with some leeway for the DaPars tools which report a '% distal used' (but as they only predict two sites we can also calculate the proximal usage very easily).

I don't think we need to dogmatically stick to this strict requirement, especially if it allows us to include more tools. However, if we open up we will have to be careful to ensure we are not unfairly excluding other tools where it is possible to adapt their quantifications to a fractional usage. A quick skim of the methods summary table suggests that this would just be APAlyzer out of currently merged tools. Perhaps also LABRAT if we commit to doing a fair bit of our own parsing (going back to GTF to get PAS coordinates, Salmon quant.sf outputs to get transcript TPMs so can sum to PAS TPMs etc.).

I guess as long as group agrees with this I would press ahead! I do think we can fairly justify relaxing requirements for the relative quantification workflow, given how most tools do not provide a TPM per-PAS. "We tried first to assess quantification using TPMs but found that most tools do not provide quantification in TPM format. However, many provide some form of relative quantification. Therefore we also adopted an alternative quantification format, normalising provided expression values to the total expression on that transcript/terminal exon/gene/'region' ('fractional expression) or taking provided outputs if matching this format.".

mrgazzara commented 2 years ago

@faricazjj I agree with Sam's assessment above. Your calculation of relative usage from the APAtrap output looks great.

I will look into APAlyzer and also agree with Sam that we can make LABRAT work as well in a similar way.

faricazjj commented 2 years ago

@mrgazzara

TAPAS Relative Quantification POC TAPAS's raw output from their first command APA_sites_detection based on their readme instruction on their github page seems like a relative quantification, but the denominator is the length of the 3'utr instead of total counts in the 3'utr. Screen Shot 2022-09-14 at 11 32 05 AM

Could this raw output be taken for relative quantification or would it be more accurate to recalculate relative quantification of each PAS but taking the last column of this raw output and do read count of each APA site/total read count in each row? Here is a screenshot of the raw output:
Also does TAPAS qualify for absolute TPM quantification by taking the last column?

Update and POC conclusion Discussed with @mrgazzara and decided that

We should be using relative abundance to calculate relative usage quantification
The last column is not equivalent to TPM, so TAPAS doesn't qualify for absolute quantification

This POC has been used to implement relative quant in TAPAS

faricazjj commented 2 years ago

@mrgazzara

APAlyzer POC

Background Currently, APAlyzer only qualifies for differential challenge. According to their readme, APAlyzer is divided into 2 parts: analysis of APA in 3’UTRs and analysis of APA in introns. Each of the two sections has a calculation of relative expression that is later used for differential calculation.

Tool output In our script to run APAlyzer (APAlyzer/workflow/scripts/APAlyzer_main.R), the two relative expression outputs are represented by the variables UTR_APA_OUT and IPA_OUT respectively. Printing these two variables out in the log (to look at the log: vim results/logs/main.log), we get the following dataframes:

[1] "Relative expression of 3'UTR APA" gene_symbol control_replicate1_areads control_replicate1_aRPKM 1 Wnk1 0 0 control_replicate1_creads control_replicate1_cRPKM control_replicate1_3UTR_RE 1 0 0 NaN srsf3_replicate2_areads srsf3_replicate2_aRPKM srsf3_replicate2_creads 1 0 0 0 srsf3_replicate2_cRPKM srsf3_replicate2_3UTR_RE 1 0 NaN

In the readme under section Analysis of APA in 3’UTRs's Calculation of relative expression, they describe this dataframe as: "The output data frame covers reads count (in aUTR or cUTR), RPKM (in aUTR or cUTR) and RE (log2(aUTR/cUTR)) for each gene"

[1] "Relative expression of IPA" gene_symbol IPAID PASid control_replicate1_IPA_UPreads 1 Cxadr 16:78329968:+Cxadr 16:78329968:+ 7 2 Cxadr 16:78333717:+Cxadr 16:78333717:+ 3 3 Cxadr 16:78340760:+Cxadr 16:78340760:+ 2003 4 Wnk1 6:119929833:-Wnk1 6:119929833:- 81 5 Wnk1 6:119930772:-Wnk1 6:119930772:- 56 6 Wnk1 6:119944330:-Wnk1 6:119944330:- 7 7 Wnk1 6:119950323:-Wnk1 6:119950323:- 0 8 Wnk1 6:119952869:-Wnk1 6:119952869:- 122 9 Wnk1 6:119953650:-Wnk1 6:119953650:- 16 10 Wnk1 6:119962308:-Wnk1 6:119962308:- 16 11 Wnk1 6:119970747:-Wnk1 6:119970747:- 0 control_replicate1_IPA_UPRPK control_replicate1_IPA_DNreads 1 8.652658 19 2 15.075377 3 3 456.472197 49 4 28.571429 59 5 29.535865 23 6 18.918919 5 7 0.000000 195 8 1207.920792 180 9 133.333333 59 10 12.075472 6 11 0.000000 0 control_replicate1_IPA_DNRPK control_replicate1_LEreads 1 6.425431 19 2 6.651885 19 3 2.655827 19 4 35.056447 2619 5 25.442478 2619 6 10.799136 2619 7 1015.625000 2619 8 206.185567 2619 9 86.764706 2619 10 3.790272 2619 11 0.000000 2619 control_replicate1_LERPK control_replicate1_IPA_RE 1 33.74778 -3.92147245 2 33.74778 -2.00230222 3 33.74778 3.74924416 4 963.57616 0.00000000 5 963.57616 -7.87895978 6 963.57616 -6.89081374 7 963.57616 0.00000000 8 963.57616 0.05603063 9 963.57616 -4.37096843 10 963.57616 -6.86171835 11 963.57616 0.00000000 srsf3_replicate2_IPA_UPreads srsf3_replicate2_IPA_UPRPK 1 15 18.541409 2 3 15.075377 3 1627 370.783956 4 80 28.218695 5 58 30.590717 6 4 10.810811 7 1 5.714286 8 48 475.247525 9 5 41.666667 10 10 7.547170 11 3 27.272727 srsf3_replicate2_IPA_DNreads srsf3_replicate2_IPA_DNRPK 1 28 9.469056 2 9 19.955654 3 133 7.208672 4 54 32.085561 5 22 24.336283 6 7 15.118790 7 162 843.750000 8 65 74.455899 9 14 20.588235 10 6 3.790272 11 2 1.700680 srsf3_replicate2_LEreads srsf3_replicate2_LERPK srsf3_replicate2_IPA_RE 1 79 140.3197 -3.9510972 2 79 140.3197 0.0000000 3 79 140.3197 1.3735364 4 2040 750.5519 0.0000000 5 2040 750.5519 -6.9069286 6 2040 750.5519 0.0000000 7 2040 750.5519 0.0000000 8 2040 750.5519 -0.9050994 9 2040 750.5519 -5.1541124 10 2040 750.5519 -7.6422660 11 2040 750.5519 -4.8753122

In the readme under section Analysis of APA in introns's Calculation of relative expression, they describe this dataframe as: "The output data frame contains read count and read density IPA upstream (a), IPA downstream (b) and 3’-most exon region (c). The RE of IPA is calculated as log2((a - b)/c)."

Conclusion For the workflow's RE output, should we extract from the second dataframe the gene and _IPA_RE column(s) but getting rid of the log?

Update and POC conclusion @mrgazzara and I discussed that we should be extracting information from the dataframe above for "Relative expression of 3'UTR APA". However, there are a couple issues:

The table above has NaN values for the RE columns and the other read columns are 0. To look into this issue, we went back and tried to run APAlyzer following exactly the instructions on APAlyzer's readme. The script to obtain non 0 read column is contained in this file. It runs one chromosome at a time, in this script, it runs for chr19 only. This has to be specified for every chromosome in a loop.
The table doesn't provide coordinate and strand information that we need for the relative quantification challenge output bed file.

In concolusion, APAlyzer won't be included in relative usage quantification challenge since it doesn't output coordinate and strand information needed in relative usage quantification challenge bed file.

iRNA-COSI / APAeval

POC of relative expression calculation in execution workflow #278

1. New output files for execution workflows to capture the relative expression values (branch 'update_ewf_specs' https://github.com/iRNA-COSI/APAeval/commit/88b512030e8cca0aaa942637a731f17f67bcc940)

2. Calculate relative usage of ground truth in summary workflows