katholt / srst2

Short Read Sequence Typing for Bacterial Pathogens
Other
125 stars 65 forks source link

Miscalling alleles with only indels differences between variants #27

Closed Takadonet closed 9 years ago

Takadonet commented 9 years ago

This will be a long post so the quick summary is that the new version of SRST2 added a new parameter which causes the miss prediction of allele ldh from 1 to 197. It effect only results where the different their is larger in length between different variant of an allele. Parameter in question is called "--mlst_max_mismatch" with default value of 10.

Possible solutions

Details:

A few new parameters were added to SRST2 between the different version from 0.2.0 to 0.2.1 . One of theses parameters is called "--mlst_max_mismatch". Description from the srst2.py --help is the following :

--mlst_max_mismatch MLST_MAX_MISMATCH

Maximum number of mismatches per read for MLST allele calling (default 10)

The parameter will filter reads AFTER they have been mapped with bowtie2 based on the CIGAR string. If it see that it has more then 10 mismatch/indels, it throws out the read from the alignment completely. So another words, pre-filtering data before it starts determining the alleles.

However, this pre-filtering removes data that helps with making the create call in the case with ldh alelle.

Below is a screenshot of the alignment between ldh 1 and ldh 197. They are 100% identical but ldh_1 has an extra 19 base pairs near the end. gap

Since both variants of the alleles have a length differences of 19 base pairs, all reads aligning to ldh_197 that have the extra 19 base pairs are thrown out but should be kept! Attached is the pileup view of the region in question against ldh_197. Top windows is the old version of srst 0.20 and bottom is 0.2.1. As you can see, almost all reads are thrown out in that region.

So when it comes time to score each allele variants,both ldh_1 and ldh_197 both report having no indels when it 197 variant should have 19!

bad_srst_pileup

katholt commented 9 years ago

Quick fix: "turn off" this step by setting --mlst_max_mismatch to a large value, such as the length of your reads; that will stop any mapped reads being discarded.

Permanent fix: Probably need to change code to score indel as a single mismatch regardless of length, at this step in the code.