ribosomeprofiling / riboflow

Pipeline for Ribosome Profiling Data
MIT License
13 stars 9 forks source link

Running Riboflow on a transcriptome without CDS, 3' or 5' UTR defined? #7

Closed bioinfonerd closed 4 years ago

bioinfonerd commented 4 years ago

My purpose for using Riboseq is to help define what the CDS, 3' or 5' UTR are for the RNA-Seq and Ribo-Seq data I have. I like the ribo data structure developed and the implementation of riboflow, but the criteria to require the UTR and CDS may be a deal-breaker for me.

I have to ask, is there a way to deactivate that criteria?

hakanozadam commented 4 years ago

The short answer is no, not directly, BUT there is a workaround.

If the transcript is missing the 5' UTR or 3' UTR, you can simply extend the CDS on either side by, say 100 nucleotides (or even more) to artificially introduce 5'UTR or 3'UTR sequences for that transcript. This has been successfully done for one case before.

So, for example, if a transcript T is missing 3' UTR,

1) Go to the genomic reference, find the stop site of T, grab 100 nts downstream of T, call them T_UTR3.

2) Append T_UTR3 to the transcriptomic reference sequence of T

3) Update the annotation of the transcript so that T has a 3' UTR of length 100 by giving the corresponding boundary values in the annotation bed file.

4) Compile the reference and run RiboFlow with the new reference. You can take a look at this repo , and the python notebooks therein, for making new references.

bioinfonerd commented 4 years ago

And what about if the CDS are not known or possibly altered? I could 'guess' on what the CDS would be based on different ORF metrics, but its likely not 100% correct. It sounds like the defined regions are how the summary statistics and ribo is made.

But how are reads that aligned outside of the defined CDS? I was sort of thinking of using your workflow with a 2-pass type setup where the CDS region would be updated from the first round.

hakanozadam commented 4 years ago

You are right in that RiboFlow heavily relies on annotation. RiboFlow maps the reads to the entire transcript ( which is 5'UTR and CDS and 3' UTR). So sequences, outside of CDS, are still aligned and quantified in RiboFlow. The resulting quantified values such as metagene and region counts are based on the distribution of the reads over the regions CDS and UTRs. Hence it is not possible to run RiboFlow without annotation.

Given an annotation and assuming that your ribosome profiling data is very high quality (and therefore reliable), RiboFlow would be a very useful tool to understand the quality of your annotation. For example, CDS percentage in region counts should be very high for a good annotation. However, RiboFlow, by itself, is not very useful to determine the annotation of the transcriptome. That would require an additional step. So, going back to the idea of 2-pass setup, RiboFlow would be very useful for the second step to test or score the annotation. However, an additional computational tool is needed for the first step to determine the CDS and UTRs.