katholt / srst2

Short Read Sequence Typing for Bacterial Pathogens
Other
123 stars 65 forks source link

A set of improvements fixing problems with MLST on Klebsiella pneumoniae genomes #62

Closed andreyto closed 8 years ago

andreyto commented 8 years ago

I was running SRST2 on about 380 K. pneumonia genomes, sequenced with 250x2 HiSeq. Our shop has been using SRST2 for other datasets as well, but this was my first time. These sequences happened to have fairly short insert size (median 320 bp), for whatever reason.

After running current upstream master branch on the first batch of 190 genomes, more than half of them received undefined (NF) ST type. However, we had a rough idea that a lot of them should have been ST258. This discrepancy made me to look at the code. I have spent a few days, fixing a couple of obvious bugs, making a few minor improvements, and a couple of major "algorithm" modifications.

With all the changes (and without tweaking of the parameters), I now get 60 NF out of entire set of 380 genomes, regardless if I trim the sequences or not. I have tried a few more things on top of it (different scoring schemes, Bowtie2 parameter tweaks etc) but did not get any further. So, these 60 NF are probably the limit with the current SRST2 design in this dataset.

Here, I will only describe the cases that prompted the two "algorithmic" changes. The rest is well covered by the commit messages.

Change 1

For phoE gene, incorrect allele 240 was selected instead of the correct allele 1. BAM file from SRST2 for phoE_1 gene looked like this: phoE_1 It appears that small fragments that do not belong here get mapped (because SRST2 has to use --local for Bowtie2 in order to map anything at all at the edge of the reference).

The incorrect allele 240 got those too, but somehow in smaller amount, and won as a result: image

The fix: I have modified the code that post-processes the initial SAM file to drop all reads that have overhang over the reference sequence at either side of the read. This fixed most of NFs.

Change 2

In the remaining NFs, I have noticed this situation: The correct allele 79 looked like: image

But instead this allele 179 was selected: image

Selection of allele 179 looked clearly wrong. When looking at the code, I came up with this explanation: You compute the p-value in a one-sided test that this many or fewer matches to the reference were sampled from a binomial distribution. Then you essentially see for which allele the low-end p-values are lower (the lower p-values define the slope in your regression). In this design, higher read depth penalizes the allele, because it results in more extreme p-values for the same ratio of mismatches. But let us think about the reads that mapped to allele 79 above and did not map to allele 179. If we forced them to map to achieve a similar depth for both alleles, we would see a lot more mismatches for the allele 179. This was why those reads did not map in the first place.

The fix: To model (conservatively) mismatches in these unmapped reads without making dramatic changes to the SRST2 algorithm, I scale the observed match/mismatch counts for each allele to the maximum average allele depth observed over all alleles of a given gene, before the binomial test is done. I still kept your scaling of the p-values within the allele, although I could not see any statistical foundation for doing so (it just resulted in better benchmarking results). The count scaling took care of the example above, and resulted in further across the board improvement in the number of NF assignments.

We have also run several of the original NF cases through the MLST Web server to see how assignment based on de-novo consensus sequence would work, and got the results I was getting after my fixes (except for one case where the minor allele was at 48%).

Thanks, Andrey

katholt commented 8 years ago

Thanks Andrey. As discussed via email, we are incorporating most of these changes, including #1 and bug fixes, into the new release, as we have found they improve overall MLST call rates for tricky alleles of Klebsiella while not negatively impacting other performance. Change #2 appears to improve only occasional edge cases (mixed samples) while making overall calling worse so we are not including that. These changes are incorporated in the new version of srst2.py on the master branch and will be in the new release; rather than accepting the pull request which includes some other changes to the directory structures etc which we are not keen to do. Thanks for your thoughtful contributions here, it is much appreciated.