Closed faricazjj closed 1 year ago
@faricazjj I've looked into this a bit for the test data and the other mouse samples you ran locally. I agree with you that the output of the assemble step is not suitable. Spot checking through some of the coordinates, many seem to correspond to transcription start sites as well. This leads me to think it's some more general gene/transcript annotation derivative file and we cannot really tell what is what easily.
Using the compare step and parsing from the relative_usage_quantification_out
is the way to go, BUT currently we are only grabbing half of the sites from this output (the proximal PAS only). The output of relative_usage_quantification_out
looks like this:
We are grabbing, I think, the coordinate from changepoint
column. This is great and corresponds to the changepoint within the broader region considered, meaning this is a proximal PAS only. The distal PAS can also be obtained from each row. The locus
column represents the entire region examined by the tool (upstream_segment
+ downstream_segment
) so to get the dPAS coordinate we just need to take the end
coordinate from locus
if strand == +
and take the start
coordinate from locus
if strand == -
.
Maybe there's another issue on how to make this work for relative inclusion, but just continuing on from the above to get it written down, because each row represents a pPAS + dPAS combo we can grab the relative quantification information from here as well from the site_usage
column (second to last). So site_usage
should be the percent pPAS used (the changepoint
coordinate above) and 1 - site_usage
would be the percent dPAS used (from the locus
coordinate described above).
I am not sure how / if the tool will report genes with single PAS only (no changepoint).
Thank you so much for the review and the fix @mrgazzara !! :D I added the extraction of distal PAS in the new commit so we're now getting two sites from each row, one for proximal and the other for distal. Here's an example for control_r1 of the test data: Isoscm compare step raw txt output:
Our workflow's identification output bed:
Does this look right?
Fixes #429
Background Currently, we get results for identification challenge from one of the output files from the assemble step. Initially, this file seems to contain the identified changepoints just by looking at the content of the file.
But as I read further beyond the assemble step, the second step they describe is the compare command. The compare command "reports the differential usage of each identified change-point", which I expected to show the site usages of the same sites as the sites from the previous assemble step shown in the screenshot above. But I see less and also different sites in the compare step output file
This led me to look more into whether we should be extracting sites for identification challenge from the assemble or compare step.
As additional context, the steps to run isoscm are:
Discussion Even though the first isoscm/tmp/{sample}.cp.filtered.gtf file contains changepoint locations, I'm not entirely sure we should obtain identification output sites from there since it's in a tmp folder and the github readme doesn't explain what the files in the tmp folders are--they only explained the files outside of the tmp folder from assemble step i.e they explained the files from step 2 above but not files from step 1. I think the sites for identification might have to be obtained from the compare step output (i.e. isoscm/compare/{sample}.txt). After reading their readme, I checked their paper and saw that in their paper, they didn't mention assemble step to be where we get identified changpoints. They mentioned from "...Using the “assemble” keyword IsoSCM will assemble the mapped reads in a BAM file into a splice graph, identify nested terminal exons boundaries using the constrained segmentation procedure, and report the resulting models in GTF format....Pairwise comparison of tandem isoform usage can be performed using the “compare” keyword, which reports the relative usage of change points in each sample in a tabular format."
Outcome Hence, it sounds like the compare step outputs the identified change points (or PAS) that we want to report.
Testing The changes in this PR works locally with our test data
Checklist