Some alleles are truncations of other alleles. If the specimen matches the bigger allele it obviously also matches the smaller one. In the past, the algorithm would detect that two different alleles had matched and decide that there has been some sort of contamination.
The algorithm now detects if there is overlap between two perfect matches and only returns the bigger one.
There are a couple of anomalies that I spotted while doing this work. The biggest one is that the algorithm favours a lightly better match over a much longer but almost perfect match. For example, if there is contamination with a single SNP (in either the actual sample or the contaminant) then the allele with the SNP isn't detected. Likewise if there is an allele with a truncated version, if there is a SNP in the area that is truncated, it erroneously reports the smaller allele. I've not fixed either issue yet but I have added a FIXME in the relevant code.
Please merge https://github.com/sanger-pathogens/mlst_check/pull/50 and https://github.com/sanger-pathogens/mlst_check/pull/51 first.
Some alleles are truncations of other alleles. If the specimen matches the bigger allele it obviously also matches the smaller one. In the past, the algorithm would detect that two different alleles had matched and decide that there has been some sort of contamination.
The algorithm now detects if there is overlap between two perfect matches and only returns the bigger one.
There are a couple of anomalies that I spotted while doing this work. The biggest one is that the algorithm favours a lightly better match over a much longer but almost perfect match. For example, if there is contamination with a single SNP (in either the actual sample or the contaminant) then the allele with the SNP isn't detected. Likewise if there is an allele with a truncated version, if there is a SNP in the area that is truncated, it erroneously reports the smaller allele. I've not fixed either issue yet but I have added a FIXME in the relevant code.