joshuagryphon / plastid

Position-wise analysis of sequencing and genomics data
https://plastid.readthedocs.io
Other
35 stars 16 forks source link

single gene for phase_by_size #10

Closed chouhj closed 7 years ago

chouhj commented 7 years ago

Hi,

First I would like to thank you for providing this useful and easy-to-use pipeline. It is a great tool for a biologist like me who has limit ability to write codes. I have used metagene and phase_by_size commands and both work well (maybe a bug with UTR in metagene but I fixed it with a pseudo annotation).

I wonder if you can add a feature to phase_by_size to pull out the phasing data for each gene instead of metagene info grouped by read sizes. That would be super helpful for me to look at frame shifting at single-gene level since I want to check if any gene is frame shifted in my yeast mutants and it cannot be seen in metagene data.

Thank you very much!

joshuagryphon commented 7 years ago

Hi @chouhj

Thank you for writing with this suggestion to make plastid better! This particular feature request is pretty specific to your experimental needs, so I think it falls a bit outside the purview of plastid. That said, it should be fairly easy for you to calculate what you need using either get_count_vectors or following examples in this tutorial: http://plastid.readthedocs.io/en/latest/examples/phasing.html). In addition, Audrey Michel wrote a really nice paper on (and accompanying software for) identifying frameshifts in ribosome profiling data: http://genome.cshlp.org/cgi/pmidlookup?view=long&pmid=22593554 . If you haven't already read it, I suggest you check it out. It might perform exactly the analysis you need!

Also, having worked on detection of frameshift events in ribosome profiling myself (in really fun, but unpublished project I am unlikely ever write up), I would caution against trying to identify frameshifts by looking at phasing averaged over entire genes. First, most frameshift events are inefficient (~5% is a high frameshift rate in cases of known, regulated frameshifts), and, secondly, if the events you're looking for are inefficient or occur to late in the CDS, the contribution of out-of-frame translation to the gene-wide average might be too small to reliably detect above the noise. For example, suppose you have a high frameshift efficiency of 5%, and it occurs halfway through the position of the annotated CDS. In this case, you'd expect 2.5% of the reads to exhibit altered phasing, which might not look very different from no change, given the variance in phasing between genes.

Approaches like Audrey's, in contrast, seek out a change point, if it exists, in the data. Approaches like these can be both sensitive and precise.

I'm going to label this issue as a feature request, and will close it for now. If others respond and request this feature, I'll consider adding it in the future.

Thank you again, and please write if you have any more questions. Cheers,

Josh