Gene detection reporting - top gene per gene symbol, rather than the top gene per cluster as intended

I found bug in the code whereby we were reporting the top gene per gene symbol, rather than the top gene per cluster.

So, in those cases where there are very distinct groups of genes that share a gene symbol, you would only ever get the top scoring allele amongst them. So you can miss genes.

For example I found some cases today where I was expecting blaOXA-23 to be present, and was getting blaOXA-66 reported as the allele for ‘blaOXA’. Actually blaOXA is a common gene symbol used for genes that span as low as 70% identity, so each of these subtypes of blaOXA need to be treated as different genes that could each have alleles present. We are prepared for this because we have several distinct blaOXA clusters annotated in our resistance gene database, and recommend pre-clustering of all user databases before using with SRST2… but the code was not using the clustering IDs properly. So, instead of having blaOXA-23 (cluster 297) and blaOXA-66 (cluster 299) reported as present, I was just seeing blaOXA-66 (blaOXA) in all the outputs.

katholt / srst2

Gene detection reporting - top gene per gene symbol, rather than the top gene per cluster as intended #7