iRNA-COSI / APAeval

Community effort to evaluate computational methods for the detection and quantification of poly(A) sites and estimating their differential usage across RNA-seq samples
MIT License
13 stars 14 forks source link

POC of relative expression calculation in execution workflow #278

Closed faricazjj closed 2 years ago

faricazjj commented 2 years ago

Proof of concept for relative expression calculation in execution workflows

POC content In this issue, we discussed the following:

Outcome To keep track of the execution workflows to be updated and implemented to include relative quantification, we have created a child issue a checklist of tools to be implemented.

Child issue: https://github.com/iRNA-COSI/APAeval/issues/382

Estimate: 5d

SamBryce-Smith commented 2 years ago

Following up on this (rather late I'm afraid so sorry for that), feeling an inevitable essay coming up. I'm starting with the assumption that we're leaning towards matching our ground truth data to the relative expression outputs of each tool. This is so we can include more tools in the quantification challenge that don't quite fit our per-PAS TPM model due to not reporting per-PAS quant (DaPars(2)) or having a different metric to pas vs all other pas (e.g. LABRAT's gene-level psi).

NB: Have discussed a lot of these points with @mrgazzara but these are pretty much entirely my opinions/thoughts. I can break down more specifically if people have any queries/like this approach.

Tool summary:

Summary of my road-map:

1. New output files for execution workflows to capture the relative expression values (branch 'update_ewf_specs' https://github.com/iRNA-COSI/APAeval/commit/88b512030e8cca0aaa942637a731f17f67bcc940)

2. Calculate relative usage of ground truth in summary workflows

For 'region-level' relative usage tools:

Notes:

dominikburri commented 2 years ago

Hi @SamBryce-Smith, many thanks for the detailed descriptions! I like the approach to split relative usage into site-based and region-based. Maybe I missed it, but how would the prediction be compared against the ground truth? Similar to the TPM-based quantification with correlation? That is, what would be the metrics? And when we need to compute relative usage for the ground truth, wouldn't some tools benefit more than others? Because of e.g. how the relative usage is computed? or which information is incorporated? Maybe my comments are out of hands, so please let me know!

faricazjj commented 2 years ago

Hi @SamBryce-Smith @mrgazzara! I looked into APAtrap on how to get relative usage output here are the explanations I came up with.

APAtrap APAtrap's original absolute TPM quantification output for control_r1 looks like the following:

Screen Shot 2022-08-24 at 5 53 32 AM

For each transcript id, we have a list of all identified PAS and their expression levels from the most proximal to most distal if the strand is +, and the opposite for a - strand (source: APAtrap wiki).

The raw output above is then transformed into the following quantification output file:

Screen Shot 2022-08-30 at 8 28 09 PM

To get relative usage output, we would get each PAS in the transcript divided by the total sum of expression levels of all the PAS in the same transcript. For example, the screenshot above of the original output would be converted to the following:

Screen Shot 2022-08-30 at 8 27 34 PM

Taking the example of ENSMUST00000023572.14, from the quantification output file, we get the following:

ENSMUST00000023572.14|16|78336188-78340752|+    0.0
ENSMUST00000023572.14|16|78336188-78340752|+    50.15
ENSMUST00000023572.14|16|78336188-78340752|+    35.57   

In the relative quantification output file, the score column of the corresponding rows become:

ENSMUST00000023572.14|16|78336188-78340752|+    0.0
ENSMUST00000023572.14|16|78336188-78340752|+    0.5850443303779748 
ENSMUST00000023572.14|16|78336188-78340752|+    0.4149556696220252

Does this look right and should we go ahead with this implementation?

SamBryce-Smith commented 2 years ago

Thanks @faricazjj, that also looks correct to me! Just one comment; in the past we have 'disqualified' tools from challenges because they don't 'directly' provide the required output. This was also our initial starting point for the relative expression challenge, with some leeway for the DaPars tools which report a '% distal used' (but as they only predict two sites we can also calculate the proximal usage very easily).

I don't think we need to dogmatically stick to this strict requirement, especially if it allows us to include more tools. However, if we open up we will have to be careful to ensure we are not unfairly excluding other tools where it is possible to adapt their quantifications to a fractional usage. A quick skim of the methods summary table suggests that this would just be APAlyzer out of currently merged tools. Perhaps also LABRAT if we commit to doing a fair bit of our own parsing (going back to GTF to get PAS coordinates, Salmon quant.sf outputs to get transcript TPMs so can sum to PAS TPMs etc.).

I guess as long as group agrees with this I would press ahead! I do think we can fairly justify relaxing requirements for the relative quantification workflow, given how most tools do not provide a TPM per-PAS. "We tried first to assess quantification using TPMs but found that most tools do not provide quantification in TPM format. However, many provide some form of relative quantification. Therefore we also adopted an alternative quantification format, normalising provided expression values to the total expression on that transcript/terminal exon/gene/'region' ('fractional expression) or taking provided outputs if matching this format.".

mrgazzara commented 2 years ago

@faricazjj I agree with Sam's assessment above. Your calculation of relative usage from the APAtrap output looks great.

I will look into APAlyzer and also agree with Sam that we can make LABRAT work as well in a similar way.

faricazjj commented 2 years ago

@mrgazzara

TAPAS Relative Quantification POC TAPAS's raw output from their first command APA_sites_detection based on their readme instruction on their github page seems like a relative quantification, but the denominator is the length of the 3'utr instead of total counts in the 3'utr. Screen Shot 2022-09-14 at 11 32 05 AM

  1. Could this raw output be taken for relative quantification or would it be more accurate to recalculate relative quantification of each PAS but taking the last column of this raw output and do read count of each APA site/total read count in each row? Here is a screenshot of the raw output:

    Screen Shot 2022-09-14 at 8 25 21 AM
  2. Also does TAPAS qualify for absolute TPM quantification by taking the last column?

Update and POC conclusion Discussed with @mrgazzara and decided that

  1. We should be using relative abundance to calculate relative usage quantification
  2. The last column is not equivalent to TPM, so TAPAS doesn't qualify for absolute quantification

This POC has been used to implement relative quant in TAPAS

faricazjj commented 2 years ago

@mrgazzara

APAlyzer POC

Background Currently, APAlyzer only qualifies for differential challenge. According to their readme, APAlyzer is divided into 2 parts: analysis of APA in 3’UTRs and analysis of APA in introns. Each of the two sections has a calculation of relative expression that is later used for differential calculation.

Tool output In our script to run APAlyzer (APAlyzer/workflow/scripts/APAlyzer_main.R), the two relative expression outputs are represented by the variables UTR_APA_OUT and IPA_OUT respectively. Printing these two variables out in the log (to look at the log: vim results/logs/main.log), we get the following dataframes:

[1] "Relative expression of 3'UTR APA" gene_symbol control_replicate1_areads control_replicate1_aRPKM 1 Wnk1 0 0 control_replicate1_creads control_replicate1_cRPKM control_replicate1_3UTR_RE 1 0 0 NaN srsf3_replicate2_areads srsf3_replicate2_aRPKM srsf3_replicate2_creads 1 0 0 0 srsf3_replicate2_cRPKM srsf3_replicate2_3UTR_RE 1 0 NaN

In the readme under section Analysis of APA in 3’UTRs's Calculation of relative expression, they describe this dataframe as: "The output data frame covers reads count (in aUTR or cUTR), RPKM (in aUTR or cUTR) and RE (log2(aUTR/cUTR)) for each gene"

[1] "Relative expression of IPA" gene_symbol IPAID PASid control_replicate1_IPA_UPreads 1 Cxadr 16:78329968:+Cxadr 16:78329968:+ 7 2 Cxadr 16:78333717:+Cxadr 16:78333717:+ 3 3 Cxadr 16:78340760:+Cxadr 16:78340760:+ 2003 4 Wnk1 6:119929833:-Wnk1 6:119929833:- 81 5 Wnk1 6:119930772:-Wnk1 6:119930772:- 56 6 Wnk1 6:119944330:-Wnk1 6:119944330:- 7 7 Wnk1 6:119950323:-Wnk1 6:119950323:- 0 8 Wnk1 6:119952869:-Wnk1 6:119952869:- 122 9 Wnk1 6:119953650:-Wnk1 6:119953650:- 16 10 Wnk1 6:119962308:-Wnk1 6:119962308:- 16 11 Wnk1 6:119970747:-Wnk1 6:119970747:- 0 control_replicate1_IPA_UPRPK control_replicate1_IPA_DNreads 1 8.652658 19 2 15.075377 3 3 456.472197 49 4 28.571429 59 5 29.535865 23 6 18.918919 5 7 0.000000 195 8 1207.920792 180 9 133.333333 59 10 12.075472 6 11 0.000000 0 control_replicate1_IPA_DNRPK control_replicate1_LEreads 1 6.425431 19 2 6.651885 19 3 2.655827 19 4 35.056447 2619 5 25.442478 2619 6 10.799136 2619 7 1015.625000 2619 8 206.185567 2619 9 86.764706 2619 10 3.790272 2619 11 0.000000 2619 control_replicate1_LERPK control_replicate1_IPA_RE 1 33.74778 -3.92147245 2 33.74778 -2.00230222 3 33.74778 3.74924416 4 963.57616 0.00000000 5 963.57616 -7.87895978 6 963.57616 -6.89081374 7 963.57616 0.00000000 8 963.57616 0.05603063 9 963.57616 -4.37096843 10 963.57616 -6.86171835 11 963.57616 0.00000000 srsf3_replicate2_IPA_UPreads srsf3_replicate2_IPA_UPRPK 1 15 18.541409 2 3 15.075377 3 1627 370.783956 4 80 28.218695 5 58 30.590717 6 4 10.810811 7 1 5.714286 8 48 475.247525 9 5 41.666667 10 10 7.547170 11 3 27.272727 srsf3_replicate2_IPA_DNreads srsf3_replicate2_IPA_DNRPK 1 28 9.469056 2 9 19.955654 3 133 7.208672 4 54 32.085561 5 22 24.336283 6 7 15.118790 7 162 843.750000 8 65 74.455899 9 14 20.588235 10 6 3.790272 11 2 1.700680 srsf3_replicate2_LEreads srsf3_replicate2_LERPK srsf3_replicate2_IPA_RE 1 79 140.3197 -3.9510972 2 79 140.3197 0.0000000 3 79 140.3197 1.3735364 4 2040 750.5519 0.0000000 5 2040 750.5519 -6.9069286 6 2040 750.5519 0.0000000 7 2040 750.5519 0.0000000 8 2040 750.5519 -0.9050994 9 2040 750.5519 -5.1541124 10 2040 750.5519 -7.6422660 11 2040 750.5519 -4.8753122

In the readme under section Analysis of APA in introns's Calculation of relative expression, they describe this dataframe as: "The output data frame contains read count and read density IPA upstream (a), IPA downstream (b) and 3’-most exon region (c). The RE of IPA is calculated as log2((a - b)/c)."

Conclusion For the workflow's RE output, should we extract from the second dataframe the gene and _IPA_RE column(s) but getting rid of the log?

Update and POC conclusion @mrgazzara and I discussed that we should be extracting information from the dataframe above for "Relative expression of 3'UTR APA". However, there are a couple issues:

  1. The table above has NaN values for the RE columns and the other read columns are 0. To look into this issue, we went back and tried to run APAlyzer following exactly the instructions on APAlyzer's readme. The script to obtain non 0 read column is contained in this file. It runs one chromosome at a time, in this script, it runs for chr19 only. This has to be specified for every chromosome in a loop.

  2. The table doesn't provide coordinate and strand information that we need for the relative quantification challenge output bed file.

In concolusion, APAlyzer won't be included in relative usage quantification challenge since it doesn't output coordinate and strand information needed in relative usage quantification challenge bed file.