tseemann / abricate

:mag_right: :pill: Mass screening of contigs for antimicrobial and virulence genes
GNU General Public License v2.0
364 stars 90 forks source link

Incorrect gene symbols from NCBI database #159

Open evolarjun opened 3 years ago

evolarjun commented 3 years ago

It looks like abricate is reporting the internal family-id instead of a gene symbol for the "ncbi" database. They look very similar and the AMRFinderPlus database file formats were not designed with reuse in mind, so the confusion is understandable.

We have lots of extra "family" symbols because our database has a lot of structure that isn't represented in the nomenclature.

Unfortunately we did not create the AMRFinderPlus database with public consumption in mind. File formats are documented in https://github.com/ncbi/amr/wiki/AMRFinderPlus-database#file-formats, and none of the fields in the AMR_CDS file are exactly gene symbols. Allele symbols are included there, but "gene" symbols are not. The ReferenceGeneCatalog.txt is the canonical public data source and has correct gene symbols for each of the sequences in the Pathogen Detection Reference Gene Catalog.