iRNA-COSI / APAeval

Community effort to evaluate computational methods for the detection and quantification of poly(A) sites and estimating their differential usage across RNA-seq samples
MIT License
13 stars 14 forks source link

[Bug Fix] IsoSCM identification output should be from assemble step #431

Closed faricazjj closed 1 year ago

faricazjj commented 1 year ago

Fixes #429

Background Currently, we get results for identification challenge from one of the output files from the assemble step. Initially, this file seems to contain the identified changepoints just by looking at the content of the file. Screen Shot 2022-07-26 at 11 32 38 AM

But as I read further beyond the assemble step, the second step they describe is the compare command. The compare command "reports the differential usage of each identified change-point", which I expected to show the site usages of the same sites as the sites from the previous assemble step shown in the screenshot above. But I see less and also different sites in the compare step output file Screen Shot 2022-07-26 at 11 35 09 AM

This led me to look more into whether we should be extracting sites for identification challenge from the assemble or compare step.

As additional context, the steps to run isoscm are:

  1. run assemble step which creates a tmp folder: isoscm/tmp/{sample}.cp.filtered.gtf
  2. run compare step, this requires the xml file output from assemble step for two samples, but since we're getting site usage per sample, I put the same sample twice as input to obtain the following output: Screen Shot 2022-07-26 at 12 18 54 PM

Discussion Even though the first isoscm/tmp/{sample}.cp.filtered.gtf file contains changepoint locations, I'm not entirely sure we should obtain identification output sites from there since it's in a tmp folder and the github readme doesn't explain what the files in the tmp folders are--they only explained the files outside of the tmp folder from assemble step i.e they explained the files from step 2 above but not files from step 1. I think the sites for identification might have to be obtained from the compare step output (i.e. isoscm/compare/{sample}.txt). After reading their readme, I checked their paper and saw that in their paper, they didn't mention assemble step to be where we get identified changpoints. They mentioned from "...Using the “assemble” keyword IsoSCM will assemble the mapped reads in a BAM file into a splice graph, identify nested terminal exons boundaries using the constrained segmentation procedure, and report the resulting models in GTF format....Pairwise comparison of tandem isoform usage can be performed using the “compare” keyword, which reports the relative usage of change points in each sample in a tabular format."

Outcome Hence, it sounds like the compare step outputs the identified change points (or PAS) that we want to report.

Testing The changes in this PR works locally with our test data

Screen Shot 2022-09-27 at 5 57 07 PM

Checklist

mrgazzara commented 1 year ago

@faricazjj I've looked into this a bit for the test data and the other mouse samples you ran locally. I agree with you that the output of the assemble step is not suitable. Spot checking through some of the coordinates, many seem to correspond to transcription start sites as well. This leads me to think it's some more general gene/transcript annotation derivative file and we cannot really tell what is what easily.

Using the compare step and parsing from the relative_usage_quantification_out is the way to go, BUT currently we are only grabbing half of the sites from this output (the proximal PAS only). The output of relative_usage_quantification_out looks like this: Screen Shot 2022-09-28 at 8 50 51 AM

We are grabbing, I think, the coordinate from changepoint column. This is great and corresponds to the changepoint within the broader region considered, meaning this is a proximal PAS only. The distal PAS can also be obtained from each row. The locus column represents the entire region examined by the tool (upstream_segment + downstream_segment) so to get the dPAS coordinate we just need to take the end coordinate from locus if strand == + and take the start coordinate from locus if strand == -.

Maybe there's another issue on how to make this work for relative inclusion, but just continuing on from the above to get it written down, because each row represents a pPAS + dPAS combo we can grab the relative quantification information from here as well from the site_usage column (second to last). So site_usage should be the percent pPAS used (the changepoint coordinate above) and 1 - site_usage would be the percent dPAS used (from the locus coordinate described above).

I am not sure how / if the tool will report genes with single PAS only (no changepoint).

faricazjj commented 1 year ago

Thank you so much for the review and the fix @mrgazzara !! :D I added the extraction of distal PAS in the new commit so we're now getting two sites from each row, one for proximal and the other for distal. Here's an example for control_r1 of the test data: Isoscm compare step raw txt output: Screen Shot 2022-09-28 at 11 30 04 AM

Our workflow's identification output bed: Screen Shot 2022-09-28 at 11 30 39 AM

Does this look right?