hsinnan75 / MapCaller

MapCaller – An efficient and versatile approach for short-read alignment and variant detection in high-throughput sequenced genomes
MIT License
29 stars 5 forks source link

How do i only accept (mostly) glocal alignments? #36

Closed tseemann closed 4 years ago

tseemann commented 4 years ago

What scoring threshold does MapCaller use to decide whether an alignment is "good enough"?

If only the middle half of a read aligns, I do NOT want to use that. Poor local alignments with lots of soft clipping either end produce false-positive SNP calls. When using BAM files I can filter with samclip etc but MapCaller skips this step.

How can i control which alignments (length, score) are used? (besides -minmm)

I would like something like -maxclip 5 to allow at most 5 bp not aligned at each end of the read.

hsinnan75 commented 4 years ago

A read alignment in MapCaller is divided into several parts according to the sequence similarity to the reference. A read alignment in MapCaller probably looks like this

image

Blocks A, B, C, D are perfect match, and gaps between them include sequence variations. Therefore, MapCaller can easily identify the sequence variation from the read sequence partition. If only the middle half of a read aligns, MapCaller is able to identify such reads and discard them. If the sub-alignment at either ends has at least 5 nucleotides and the number of mismatches >= 3 or the alignment status >= 4 (alignment status = M/I/D) , it is considered a poor alignment. We infer breakpoints using such read alignments if only one end is considered poor alignment.

However, only read alignments without any clips at either ends are used for variant calling. A random false alignment would not cause a false call since the frequency is normally very low. And most of false alignments are detected and discarded, therefore, MapCaller produces high accuracy of SNP and Indel calls.

I'll try to add more arguments to let users customize the settings while running MapCaller in the future versions. Thanks for the suggestion.

tseemann commented 4 years ago

This sounds good, but I am still a bit unsure of "how much different" an alignment can be, and still be accepted. It would be good to be able to control this via the cmdline.

Thanks for the prompt answer again.

hsinnan75 commented 4 years ago

In general, a good alignment indicates the sequence identity is supposed to be at least 95%. However, this definition is very empirical. The sequencing error rate of Illumina machines is at most 2%, therefore, I think 95% is a appropriate cutoff. I'll add arguments which control read alignments in the next update (v0.9.9.18).

tseemann commented 4 years ago

@hsinnan75 i understand the 95% argument, but in bacterial genomic epidemiology that is not always the case, because we do not always have close reference genomes, and our collection may consist of 1000 isolates with divergence amongst them. these SNPs are then used for phylogenetic inference. But normally yes, it's within 5% DNA, but could be 10%.

It would be great to have an option to control the alignment sensitivity or scoring thresholds!

Illumina error rate is getting worse again (due to 1 and 2 colour lasers instead of 4) but usually below 0.5% yes.

hsinnan75 commented 4 years ago

Thank you for the explanation. I'll modify MapCaller to make it more flexible for controlling the sequence alignment.

hsinnan75 commented 4 years ago

Hi, I've uploaded an update (v.0.9.9.18) by adding "-maxclip" to set the maximal clip size in a read alignment. The default is 5. Any read alignment with clips whose size > maxclip will be discarded.