Closed jusu-E404 closed 6 years ago
Hi @PocketChangeJ ,
Please excuse my late reply here- I have been swamped for the past few months on other projects. That said, in case it is still useful, here are my thoughts:
Excluding start & stop codons is an easy and well-defined problem for single-transcript genes (the case in the demo) and for genes in which every transcript isoform uses the same start and stop codons because in these cases, masking is a well defined problem- there is only one start and stop codon to mask.
For multi-transcript genes that use multiple start or stop codons, things are much more complicated, because nucleotides that form a start or stop codon in one isoform may perform different functions (be internal codons, be out of frame, or not be in the transcript) for other isoforms. Knowing what to mask and why becomes context-dependent.
Multi-transcript analysis is further complicated by the need to assign reads to a given isoform. Making this sort of inference rigorously requires building appropriate statistical models, which is outside the purview of Plastid itself. That said, such a statistical model is implemented in ORF-RATER, which might help you in this case. In addition, per your original question, ORF-RATER's model accounts for (averaged) overabundances of reads on start and stop codons, so rather than making these corrections manually, you might want to use that.
If you have companion RNA-seq data to analyze, a number of different groups have built statistical models to infer which transcript isoforms reads come from. In particular, I recommend looking at Salmon, and Kallisto, though, again, these models are developed specifically for RNA-seq (and not ribosome profiling) data
I hope this helps. Please let me know if you have any questions- I'll get back to you quicker next time.
Cheers, Josh
Closing this issue because no more questions received. Feel free to comment & re-open if you have further questions.
Hi, First - thank you so much for sharing this tool! I'm relatively new to the data analysis world, and your tool helped me understand how to deal with my ribosome profiling results. There is one problem I think I don't understand yet how to approach - specifically, I would like to count ribosome footprints over CDS, but effectively exclude several first and last codons. You do mention this briefly in your tutorial, however I'm not sure how to go about that - should I basically try creating a mask file that would mask the codons for all the transcripts, then use it during "cs generate"? I saw the example on the masking page for creating a mask for codons for a singe transcript in the demo bed file - so I supposed there should be a way to apply this to the whole bed/gtf file? Thank you, PC