hsinnan75 / MapCaller

MapCaller – An efficient and versatile approach for short-read alignment and variant detection in high-throughput sequenced genomes
MIT License
29 stars 5 forks source link

How are ambiguous bases in reads (or reference) handled? #38

Open tseemann opened 4 years ago

tseemann commented 4 years ago
  1. It is not uncomming to have N bases in reads.
  2. Also sometimes in bacteria we get other IUPAC codes in the reference eg. R

How are these each handled in MapCaller?

hsinnan75 commented 4 years ago

In fact, bwt_index handles the issues. When there are N bases or other IUPAC codes in the reference, it just replaces them with random ACGT bases since the encoding scheme is optimized to use 2 bits for indexing DNA sequences.

tseemann commented 4 years ago

That solves (2).

How do you handle (1) ?

hsinnan75 commented 4 years ago

If there are Ns in a read sequence, they are still aligned with the reference if the read is mapped. They are not considered as mismatches when evaluating alignment quality, however, the they will affect the alignment score. If they occurs at either ends with a block, they will be clipped from the read sequence. Do you have other suggestions to handle N bases that I missed in the MapCaller?

tseemann commented 4 years ago

So they do not count for -maxmm ?

Do they count for -maxclip ?

hsinnan75 commented 4 years ago

Since Ns are normally appear at the ends of short reads, they will be identified when MapCaller checks the mapping quality at both ends and discards the sub-alignment including Ns. So, yes, they count for -maxclip.