ncbi / amr

AMRFinderPlus - Identify AMR genes and point mutations, and virulence and stress resistance genes in assembled bacterial nucleotide and protein sequence.
https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder/
Other
265 stars 37 forks source link

Different results when using annotation files in the command line #140

Closed MostafaYA closed 5 months ago

MostafaYA commented 5 months ago

Hi, I am using amrfinderplus on a C. diff genome. I used same sample with and without prior annotation. However, the results were different (Please see below!). I cannot understand the absence of the cfr(C) from the results of the first command line.

Thanks for your help

Software version: 3.12.8 Database version: 2024-01-31.1

amrfinder -p CD_21S0467-D02.faa -n CD_21S0467-D02.fasta -O Clostridioides_difficile -d ./amrfinderplus-db/latest -g CD_21S0467-D02.gff3 -a bakta  

Protein identifier   Contig id     Start   Stop    Strand   Gene symbol   Method 
------------------   -----------   -----   -----   ------   -----------   -------
ILDIMB_13575         contig00011   29924   30850   +        blaCDD-2      ALLELEP
ILDIMB_13660         contig00011   48218   50017   -        blaR1         HMM   
ILDIMB_17930         contig00022   11366   12232   -        aadE          EXACTP 
ILDIMB_18620         contig00025   19142   20071   -        erm(52)       BLASTP
$ amrfinder -n CD_21S0467-D02.fasta -O Clostridioides_difficile -d ./amrfinderplus-db/latest 

Protein identifier   Contig id     Start   Stop    Strand   Gene symbol   Method 
------------------   -----------   -----   -----   ------   -----------   -------
NA                   contig00011   29924   30847   +        blaCDD-2      ALLELEX
NA                   contig00013   30975   32063   +        cfr(C)        BLASTX 
NA                   contig00022   11369   12232   -        aadE          EXACTX 
NA                   contig00025   19145   20071   -        erm(52)       BLASTX
vbrover commented 5 months ago

Could you post the files CD_21S0467-D02.faa, CD_21S0467-D02.fasta and CD_21S0467-D02.gff3? Or at least the cfr(C) protein in CD_21S0467-D02.faa mapping on contig00013 from 30975 to 32063?

MostafaYA commented 5 months ago

Hi

here is the ànnotation info from bakta. Would you mind sending the files to you per email (fe here: pd-help@ncbi.nlm.nih.gov). The reason is that the data is not mine!

contig00013 Prodigal    CDS 30975   32225   .   +   0   ID=ILDIMB_14890;Name=23S rRNA (adenine(2503)-C(8))-methyltransferase Cfr;locus_tag=ILDIMB_14890;product=23S rRNA (adenine(2503)-C(8))-methyltransferase Cfr;Dbxref=RefSeq:WP_021434980.1,SO:0001217,UniParc:UPI00038CBF9B,UniRef:UniRef100_A0A417SUK8,UniRef:UniRef50_A0A3B0CKZ1,UniRef:UniRef90_A0A1Q1PTQ2;gene=cfr
MostafaYA commented 5 months ago

ILDIMB_14890.txt

evolarjun commented 5 months ago

Hi Mostafa,

Yes, please send the files to pd-help@ncbi.nlm.nih.gov, and mention GitHub issue 140 in the email so I can make sure I see it. We can check them there. My guess is that there is an HMM hit that is suppressing the reporting of the gene once AMRFinderPlus is able to run HMMER to match the proteins, but that's just a guess.

Thanks, Arjun

MostafaYA commented 5 months ago

I just sent the files- Let’s me know if something else is needed! Thanks in advance

evolarjun commented 5 months ago

Hi Mostafa,

Hopefully you saw my response to your email, but I will post it publicly here as well for the record.

The non-call of the cfr(C) was caused because of a somewhat obscure rule in the extensive ruleset of AMRFinderPlus where, if a query protein hits by blast a reference protein at less than 98% identity it must also be a hit above the cutoff for an HMM (if any) at a higher level in the hierarchy. We have described this in our papers, but sometimes even we forget all the details.

This particular rule was created early AMRFinder development to avoid false call edge cases. We are reviewing if the rule is something we should consider changing or dropping.

However, what has happened in your specific case is a new divergent sequence and node (cfr(C)) with a curated blast rule was later added as a child of the cfr_gen node, and the new cfr(C) sequence is different enough that the parent node HMM should have been reviewed and updated to broaden its scope. See https://www.ncbi.nlm.nih.gov/pathogens/genehierarchy/#cfr*%20OR%20cipA%20OR%20clb* to see the current hierarchy and HMM at the cfr_gen node. The HMM was reviewed and modified and the next AMRFinderPlus database release will make the cfr(C) protein you saw reported in combined mode as well as nucleotide-only mode.

Sorry again for the delayed responses and thank you very much for pointing this out. It's an edge case we hadn't caught and the particular set of circumstances required to reveal it mean we might not have noticed it for quite some time if we hadn't gotten your report.

I will close this ticket once we release a new AMRFinderPlus database that fixes this issue.

Arjun

MostafaYA commented 5 months ago

Thanks a lot

evolarjun commented 5 months ago

Hi Mostafa,

We just released the AMRFinderPlus database version 2024-05-02.2 and we reviewed a QC check we made to help identify these kinds of cases and made several changes in hierarchy structure and HMMs to hopefully prevent this from happening again for any other genes. We ended up removing the abc-f HMM that was causing this issue because we couldn't tune a single HMM at this level of the hierarchy to be both sensitive and specific.

Anyway, you should find the gene with both combined and nucleotide-only in this case, and going forward since we've added it to the suite of tests we run before every release.

Thanks again for reporting, Arjun