katholt / srst2

Short Read Sequence Typing for Bacterial Pathogens
Other
125 stars 65 forks source link

Gene detection reporting - top gene per gene symbol, rather than the top gene per cluster as intended #7

Closed katholt closed 10 years ago

katholt commented 10 years ago

I found bug in the code whereby we were reporting the top gene per gene symbol, rather than the top gene per cluster.

So, in those cases where there are very distinct groups of genes that share a gene symbol, you would only ever get the top scoring allele amongst them. So you can miss genes.

For example I found some cases today where I was expecting blaOXA-23 to be present, and was getting blaOXA-66 reported as the allele for ‘blaOXA’. Actually blaOXA is a common gene symbol used for genes that span as low as 70% identity, so each of these subtypes of blaOXA need to be treated as different genes that could each have alleles present. We are prepared for this because we have several distinct blaOXA clusters annotated in our resistance gene database, and recommend pre-clustering of all user databases before using with SRST2… but the code was not using the clustering IDs properly. So, instead of having blaOXA-23 (cluster 297) and blaOXA-66 (cluster 299) reported as present, I was just seeing blaOXA-66 (blaOXA) in all the outputs.

katholt commented 10 years ago

This will appear in the next release (0.1.3).

If you want to reanalyse your data with the new version, you don’t have to rerun any mapping, just use the --use_existing_scores flag to recall alleles based on your stored scores files (or --use_existing_pileup if you didn’t store the scores).