roblanf / sangeranalyseR

functions to analyse sanger sequencing reads in R
MIT License
95 stars 24 forks source link

consider phred scores when consensus building #48

Open roblanf opened 4 years ago

roblanf commented 4 years ago

An idea raised in https://github.com/roblanf/sangeranalyseR/issues/46, to build consensus sequences while taking account of the phred scores. It's certainly possible.

Right now we use DECIPHER's the ConsensusSequence function from DECIPHER, but we could build our own function for this and/or look around for a function that uses phred scores.

foersterst commented 4 years ago

Dear colleagues,

First of all, thanks a lot for this amazing package! I would like to ask for some help to improve the assmebly of consensus DNA sequences. I'm working with DNA sequences from the COI gene (mtDNA) and I'm usnig the function 'SangerAlignment' to get the consensus sequences. Well, I would like to start by asking what is the difference between the M1 and M2 options for the TrimmingMethod argument? Does it works with Phred values?

Second, I'm getting a large number of ambiguous bases when I try to assembly my COI sequences, which becomes indesirable because it is a mtDNA marker and "heterozygous" sites are unexpected. So, I would like to know how can I control (avoid) such ambigous bases? I tried to set the signalRatioCutoff argument to several values (0.99, 0.5, 0.1) but nothing changes. It would be nice if there is a way to set the function 'SangerAlignment' to retrieve only the peak with the greatest height (quality?) instead of put an ambigous base at that position. Finally, I wonder if someone can explain some details of the following arguments: M1TrimmingCutoff, M2CutoffQualityScore, M2SlidingWindowSize, baseNumPerRow, heightPerRow.

Thanks in advance!

Best regards.

roblanf commented 4 years ago

Hi @foersterst!

M1 and M2 methods: M1 is the modified Mott's trimming algorithm that you can also find in Phred/Phrap and Biopython. M2 is like trimmomatic's sliding window method. In general I'd recommend trying both and seeing what gives more sensible trimming for your data, using the Shiny app to visualise what the trimming methods are trimming. And yes, both use phred scores.

On the ambiguous bases: this should be controllable when you change the signalRatioCutoff. If nothing is changing and it looks like it should be according to the chromatogram, then please post a new issue with (ideally) a reproducible example. E.g. an ab1 file, and a couple of commands with different thresholds, along with the output you get, and the output you expect. More generally on this point, I would worry a lot if you are seuqencing mtDNA and your chromatograms are showing evidence of heterzygosity - this is likely to indicate contamination of some sort.

On the argument details, this is an omission in the docs that we will need to fix. I'll open a new issue for that.

Rob