Using 'centered' for read mapping - why and when?

Hi Joshua,

Thanks for helping with my previous post re: mapfactories not working. Your workaround was great. This is less of an 'issue' and hopefully more of a discussion.

My main question is why and when to use 'centered' based read mapping. Following along in your docs, it appears this normalization is used for RNA-seq data specifically, when manipulating mapping dynamics for downstream analyses.

My next question is under what circumstances would you want to 'nibble' reads as you mapped them? The only thing I can come up with is to increase your signal after reducing it substantially with fractional counting mandated by 'centered' mapping.

Lastly, and a bit of a switch in topics, what would your workflow look like to identify the proportion of various start sites used in two different conditions in riboseq data, especially if alt start sites are located in 5' UTRs? I know that plastid likely contains all of the necessary tools to ask this question, but I am having trouble conceptualizing where to begin.

My best guess is to use the segment chain and genomic segment functionality as well as the 'get_counts' method combined with some type of sub-setting command to specifically target AUGs, NUGs, etc. If you could give me a big picture idea here, that would be super helpful!

Thanks for your time and attention. Really enjoying learning how to use Plastid and all of its functionality!

Respectfully, Brad

Hi Brad,

Sorry for the delayed response on this. I have been swamped since August, and this fell off my radar. In case it is still relevant, here is my response:

center mapping, in which each position covered by a read is incremented by 1.0 / (read length), is useful when you want to see the full extent of coverage of a read (e.g. in a genome browser), but don't want to count each read more than once when quantitating expression. In the absence of this correction, the contribution to inferred expression level of a given read will be a function of its length, something you want to avoid in many pipelines.
nibble mapping is useful for ribosome profiling when there is uncertainty about where the P site is. Specifically, libraries for E. coli or D. melanogaster are often prepped with micrococcal nuclease, which only cuts 5' of A or T residues. This means that footprints are usually not completely resolved to the ends of the ribosome, and thus the P-site cannot be directly inferred as a constant distance from the fragment ends. In this specific case, a useful approximation is to nibble a few nucleotides off each end of the read, and fractionally map the P-site over whatever is left over. If you are able to use RNase I with your organism, there isn't much benefit to using nibble mapping.
For alternate start site mapping, it wouldn't be hard to implement a script that does what you describe. If you haven't alerady solved this problem, there is a package that has a nice workflow estimating differential translation of multiple overlapping ORFs (which use different start sites), called ORF-RATER, which you can find here. The associated paper (Fields, Rodriguez, et al., 2015; doi:10.1016/j.molcel.2015.11.013) can be found here

Hope this helps! I'm going to close this for now, since it's not quite an issue, but feel free to write back if you do have more questions. I'll be faster to respond.

Cheers, Josh

joshuagryphon / plastid

Using 'centered' for read mapping - why and when? #19