iRNA-COSI / APAeval

Community effort to evaluate computational methods for the detection and quantification of poly(A) sites and estimating their differential usage across RNA-seq samples
MIT License
13 stars 14 forks source link

feat(Execution workflow): CSI-UTR - report Format 04 BED file in workflow (per-PAS fractional relative usage) #388

Open SamBryce-Smith opened 2 years ago

SamBryce-Smith commented 2 years ago

Parent issue - https://github.com/iRNA-COSI/APAeval/issues/382

Per updated execution workflow output specifications, We need CSI-UTR to report the per-PAS fractional relative usage in a format 04 BED file.

CSI-UTR calculates a number of relative usage metrics, but the one that fits the format 04 convention is the 'PSI' metric which is equivalent to the percent a polyA site is used relative to total expression of PAS in that gene/terminal exon. If reported on the % scale (0-100) this needs to be converted to a fraction by dividing by 100.

yuukiiwa commented 2 years ago

Hi @SamBryce-Smith and @faricazjj, CSI-UTR's differential analysis output reports the PSI values for both of the samples in a 0-1 format, which do not need to be divided by 100. Should I split the differential analysis output reports into two quantification beds for each of the samples? Here is the example differential analysis output file from CSI-UTR/TestCases.md

CSI     ENSGENE GENE_SYM        PSI1 (LOAD)     PSI2 (Control)  deltaPSI (LOAD-Control) P-value FDR
ENSG00000189241:116278517_116277027-116276921   ENSG00000189241 TSPYL1  0.068122        0.089263        -0.021141       5e-05   0.00679553001277139
ENSG00000189241:116278517_116277126-116277027   ENSG00000189241 TSPYL1  0.086873        0.109091        -0.022218       0.000114        0.012597769470405
ENSG00000189241:116278517_116278517-116277246   ENSG00000189241 TSPYL1  0.505286        0.456247        0.049039        0       0
ENSG00000100796:91458759_91458759-91458147      ENSG00000100796 PPP4R3A 0.580057        0.443966        0.136091        0.000605        0.0425812764550264
ENSG00000174684:66346049_66345714-66345577      ENSG00000174684 B4GAT1  0.314256        0.342451        -0.028195       0.000374        0.0302434133738602
ENSG00000174684:66346049_66346049-66345844      ENSG00000174684 B4GAT1  0.215083        0.177637        0.037446        0       0
ENSG00000119314:112223851_112219214-112219045   ENSG00000119314 PTBP3   0.134671        0.074944        0.059727        3e-06   0.000676385593220339
ENSG00000196652:99532247_99532615-99532684      ENSG00000196652 ZKSCAN5 0.036234        0.11213 -0.075896       6.2e-05 0.00795888540410133
ENSG00000126785:63291182_63291740-63291852      ENSG00000126785 RHOJ    0.047283        0.178141        -0.130858       0.000583        0.0415550529135968
ENSG00000115310:54973156_54972313-54972195      ENSG00000115310 RTN4    0.08397 0.101558        -0.017588       0       0
ENSG00000115310:54973156_54972352-54972313      ENSG00000115310 RTN4    0.144597        0.174865        -0.030268       0       0
ENSG00000115310:54973156_54972948-54972890      ENSG00000115310 RTN4    0.089743        0.076191        0.013552        6.6e-05 0.00830211347517731

Thanks!

faricazjj commented 2 years ago

@yuukiiwa Thanks for looking into this! :D From what I understand from the output we could split the differential analysis output into two quantification beds for each of the condition. But I'm going to tag @mrgazzara here for extra input :p I have 2 questions!

  1. So this is the tool that needs two conditions and two replicates per condition and the output will be relative usage per condition. Is this still something we want to implement now considering the other tools calculate relative usage per sample only and not per condition?
  2. Where do we extract PAS from the output file?
mrgazzara commented 2 years ago

I will have to look into this a little bit further. The usual way to get individual sample quantification with tools like this that require multiple conditions (because they're more focused on differential) is to run it with the same sample against itself. The requirement to also have a replicate might be a dealbreaker. I need to read the paper to see.

faricazjj commented 2 years ago

@mrgazzara When i implemented it I think I tried running it with the same sample against itself but naming the conditions different, and the replicates were the same sample and I also named the "replicates" differently but it errored out. The only way I could run it was if the replicates were distinct