roblanf / sangeranalyseR

functions to analyse sanger sequencing reads in R
MIT License
95 stars 24 forks source link

Majority rules consensus base calling? #87

Open nick-youngblut opened 1 year ago

nick-youngblut commented 1 year ago

It appears that sangeranalyseR does not allow for majority-rules base calling, which could also include "use highest quality base" for calling. The attached alignment is an example of instead calling ambiguous bases in the consensus sequence. If even one of many reads show a different base at the position, then the consensus sequence base is ambiguous (e.g., "A" on Read1 and "G" on Read1, so "R" in the consensus).

If one does not want ambiguous bases in the consensus, then one must use very strict read filtering/trimming. Another approach would be to allow for majority-rules base calling, with "majority" weighted by the chromatogram signal intensity/quality at the target position (e.g., use "A" for the consensus at the target position because Read1 has a much "better" signal than Read2, which shows "G" for that same position).

Screen Shot 2023-01-30 at 4 10 48 PM
nick-youngblut commented 1 year ago

It would also greatly help to have a maxLength parameter, so that any sequences longer than this length will be trimmed. This is useful when one knows that the Sanger read will definitely be poor quality after N bases (e.g., trim all reads to a max of 600 bp).