sanger-pathogens / mlst_check

Multilocus sequence typing by blast using the schemes from PubMLST
http://sanger-pathogens.github.io/mlst_check/
Other
30 stars 16 forks source link

[468141,463456]: Cope better with small changes to alleles #60

Closed bewt85 closed 9 years ago

bewt85 commented 9 years ago

This PR addresses two tickets, 468141 and 463456.

The majority of this PR is to give users more information if there are small variations within an allele. Previously, a small change would either report 'UNKNOWN' or, in some cases, return a different hit (silently) or fail to mention contamination.

This also now returns the sequence data for the partial match.

I'd welcome some in depth review / critique on this PR if anyone fancies a discussion.

Testing

Rebecca and I have tested this on every known ST of strep pneumo (using synthetic data) and have checked the new results against all of the existing STs in her database.

Known issues

There's still an edge case where a variation is at the end of an allele which, in some cases, could result in the same behaviour as before. This hasn't been an issue in any of our tests and I've marked the relevant code. I don't propose that we address this for the foreseeable future; I'll raise a ticket if anyone thinks we should.

The code in the PR runs slower than the old code but not prohibitively so.

Background

MLST looks up which combinations of alleles a sequence has using Blast. Previously there were issues where one allele is a truncation of another; this was fixed in a previous ticket. This ticket fixes the scenarios where there is a SNP in the region of one version of an allele which is truncated from another. If the change is small enough, this will now prefer to return the imperfect but longer match.

There was a bit of shenanigans in the commit history because I underestimated how similar alleles were to one another but hopefully extensive testing has now fixed this.