ncbi / amr

AMRFinderPlus - Identify AMR genes and point mutations, and virulence and stress resistance genes in assembled bacterial nucleotide and protein sequence.
https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder/
Other
265 stars 37 forks source link

question - how to install older database? #112

Closed kapsakcj closed 1 year ago

kapsakcj commented 1 year ago

Hi Arjun and team,

I have a quick question - is there a way with the executable amrfinder (or otherwise) to install an older version of the amrfinderplus database? For example, if I wanted to reproduce results from last year using an older database version 2022-08-09.1. Is there a way to pin the install to that specific database version?

I know I can provide the database location with amrfinder --database /path/to/db but I wasn't sure if other things need to be done such as database indexing, setup, etc.

I realize there may be incompatibilities between the amrfinderplus version & database versions, but let's ignore those for now.

I've read through the (amazing) documentation and could not find the answer so that led me here.

Curtis

vbrover commented 1 year ago

Quick answer is that all databases are stored at https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/, so that you can choose any historic database. But you need to run makeblastdb and hmmpress and choose the compatible historic software because the latest one may not work with an old database.

vbrover commented 1 year ago

The latest amrfinder (version 3.11.4) has the program amrfinder_index which runs makeblastdb and hmmpress on a given database.

evolarjun commented 1 year ago

We haven't actually released AMRFinderPlus 3.11.4 which includes amrfinder_index, though I should be able to get that done in the next day or two. One simple way to run old versions at least since 3.10.14-2021-08-11.1 is to use the docker containers we're now producing. They freeze a given version of the software and database together. See https://hub.docker.com/r/ncbi/amr/tags?page=1

evolarjun commented 1 year ago

Hello again Curtis,

I (mostly) finished the release of AMRFinderPlus version 3.11.4 including amrfinder_index. To download an older version of the database you can locate the directory at https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/ download all the files, and run amrfinder_index on the directory.

For example:

wget ftp://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/3.11/2022-12-19.1/*
amrfinder_update .

You can use the directory name on the FTP site for the major.minor versions of the software compatible with that database version or the major.minor version numbers in the database_format_version.txt file contained within the database.

Let us know if you can't get this to work, but to be honest, I recommend using the docker container method described above because it's simpler. (that's what I do when I want to see what results would have been with older versions)

Note that the bioconda package for this release is still not out because they seem to be having a build problem in their CI pipeline (see the pull request for status). Once that's cleared up this version should be included in bioconda as well.

Arjun

kapsakcj commented 1 year ago

Thanks for the quick replies and addition of amrfinder_index! And thanks for the example commands, that helps too. Looking forward to testing it.

I asked this question originally because I'm helping to maintain the StaPH-B docker images for amrfinder and our goal is to have a way to pin versions of dependencies, such as the amrfinder database, so this will help immensely going forward.

Instead of grabbing & indexing the latest database with amrfinder -u we would prefer to grab the database files from the FTP followed by indexing. Allows for us to re-build docker images with older DB versions, if ever necessary.

Lastly - THANK YOU for providing docker images and the dockerfiles, not many developers do that, so thank you for your efforts there.

evolarjun commented 1 year ago

I just looked at the DockerHub description and realized we don't have a link to the Dockerfile we use there, and the description of AMRFinderPlus is out of date. I don't have an authorized account to update those things, but I'll try to find someone who does. It sounds like you found it anyway, but just for future reference the Dockerfile and a script to create the image are in https://github.com/ncbi/docker/tree/master/amr

kapsakcj commented 1 year ago

Thanks for the link and yes I found the dockerfile, but that's because I've seen the ncbi/docker github repo in the past.

dockerhub can inheirit the main /README.md if linked to a GitHub repo (if you were using dockerhub infrastructure for building images), but in this case it's probably easier to just update the dockerhub repo description manually and link back to https://github.com/ncbi/docker/tree/master/amr

kapsakcj commented 1 year ago

OK getting into the nitty gritty details here. I'm re-working our dockerfile for amrfinder v3.11.2 that does not contain amrfinder_index and would like to manually index the database files. Am I doing this correctly? I'll provide my Dockerfile code and comment throughout.

I'm definitely not knowledgable about C++ but I gathered this info from the amrfinder_index code here: https://github.com/ncbi/amr/blob/amrfinder_v3.11.4/amrfinder_index.cpp

Relevant part of Dockerfile:

RUN mkdir -p /amrfinder/data/${AMRFINDER_DB_VER} && \
# download database files from NCBI FTP
wget -q -P /amrfinder/data/${AMRFINDER_DB_VER} ftp://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/3.11/${AMRFINDER_DB_VER}/* && \
# change into dir with files just downloaded
cd /amrfinder/data/${AMRFINDER_DB_VER} && \
# run hmmpress and makeblastdb on downloaded files
hmmpress AMR.LIB && \
makeblastdb -in AMRProt -dbtype prot && \
makeblastdb -in AMR_CDS -dbtype nucl && \
# have to do this step to ensure docker build is using bash shell and not /bin/sh
/bin/bash -c '\
# loop through the organism specific files, example: AMR_DNA-Clostridioides_difficile.tab
for ORG in AMR_DNA*.tab; do \
  # set a new bash variable for each FASTA file
  INPUT_FASTA=$(echo $ORG | cut -d "." -f 1); \
  # makeblastdb on each of those FASTA files
  makeblastdb -in ${INPUT_FASTA} -dbtype nucl ; \
  done' && \
# generate softlink 
ln -s /amrfinder/data/${AMRFINDER_DB_VER} /amrfinder/data/latest

The output looks correct to me, and I'm able to run a series of test with and without the amrfinder --organism flag

RUN mkdir -p /amrfinder/data/2023-02-23.1 && wget -q -P /amrfinder/data/2023-02-23.1 ftp://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/3.11/2023-02-23.1/* && cd /amrfinder/data/2023-02-23.1 && hmmpress AMR.LIB && makeblastdb -in AMRProt -dbtype prot && makeblastdb -in AMR_CDS -dbtype nucl && /bin/bash -c 'for ORG in AMR_DNA*.tab; do   INPUT_FASTA=$(echo $ORG | cut -d "." -f 1);   echo "makeblastdb -in ${INPUT_FASTA} -dbtype nucl" ;  makeblastdb -in ${INPUT_FASTA} -dbtype nucl ;   done' && ln -s /amrfinder/data/2023-02-23.1 /amrfinder/data/latest
#7 4.745 Working...    done.
#7 7.136 Pressed and indexed 688 HMMs (688 names and 688 accessions).
#7 7.136 Models pressed into binary file:   AMR.LIB.h3m
#7 7.136 SSI index for binary model file:   AMR.LIB.h3i
#7 7.136 Profiles (MSV part) pressed into:  AMR.LIB.h3f
#7 7.136 Profiles (remainder) pressed into: AMR.LIB.h3p
#7 7.165 
#7 7.165 
#7 7.165 Building a new DB, current time: 03/09/2023 23:30:59
#7 7.165 New DB name:   /amrfinder/data/2023-02-23.1/AMRProt
#7 7.165 New DB title:  AMRProt
#7 7.165 Sequence type: Protein
#7 7.166 Keep MBits: T
#7 7.166 Maximum file size: 1000000000B
#7 7.375 Adding sequences from FASTA; added 7809 sequences in 0.209066 seconds.
#7 7.382 
#7 7.382 
#7 7.414 
#7 7.414 
#7 7.414 Building a new DB, current time: 03/09/2023 23:30:59
#7 7.414 New DB name:   /amrfinder/data/2023-02-23.1/AMR_CDS
#7 7.414 New DB title:  AMR_CDS
#7 7.414 Sequence type: Nucleotide
#7 7.415 Keep MBits: T
#7 7.415 Maximum file size: 1000000000B
#7 7.669 Adding sequences from FASTA; added 7614 sequences in 0.253481 seconds.
#7 7.674 
#7 7.674 
#7 7.687 makeblastdb -in AMR_DNA-Campylobacter -dbtype nucl
#7 7.715 
#7 7.715 
#7 7.715 Building a new DB, current time: 03/09/2023 23:30:59
#7 7.715 New DB name:   /amrfinder/data/2023-02-23.1/AMR_DNA-Campylobacter
#7 7.715 New DB title:  AMR_DNA-Campylobacter
#7 7.715 Sequence type: Nucleotide
#7 7.715 Keep MBits: T
#7 7.715 Maximum file size: 1000000000B
#7 7.716 Adding sequences from FASTA; added 2 sequences in 0.00100303 seconds.
#7 7.720 
#7 7.720 
#7 7.729 makeblastdb -in AMR_DNA-Clostridioides_difficile -dbtype nucl
#7 7.757 
#7 7.757 
#7 7.757 Building a new DB, current time: 03/09/2023 23:30:59
#7 7.757 New DB name:   /amrfinder/data/2023-02-23.1/AMR_DNA-Clostridioides_difficile
#7 7.757 New DB title:  AMR_DNA-Clostridioides_difficile
#7 7.757 Sequence type: Nucleotide
#7 7.757 Keep MBits: T
#7 7.757 Maximum file size: 1000000000B
#7 7.758 Adding sequences from FASTA; added 1 sequences in 0.000326872 seconds.
#7 7.763 
#7 7.763 
#7 7.772 makeblastdb -in AMR_DNA-Enterococcus_faecalis -dbtype nucl
#7 7.798 
#7 7.798 
#7 7.798 Building a new DB, current time: 03/09/2023 23:30:59
#7 7.798 New DB name:   /amrfinder/data/2023-02-23.1/AMR_DNA-Enterococcus_faecalis
#7 7.798 New DB title:  AMR_DNA-Enterococcus_faecalis
#7 7.798 Sequence type: Nucleotide
#7 7.799 Keep MBits: T
#7 7.799 Maximum file size: 1000000000B
#7 7.800 Adding sequences from FASTA; added 1 sequences in 0.000380039 seconds.
#7 7.805 
#7 7.805 
#7 7.815 makeblastdb -in AMR_DNA-Enterococcus_faecium -dbtype nucl
#7 7.841 
#7 7.841 
#7 7.841 Building a new DB, current time: 03/09/2023 23:30:59
#7 7.841 New DB name:   /amrfinder/data/2023-02-23.1/AMR_DNA-Enterococcus_faecium
#7 7.841 New DB title:  AMR_DNA-Enterococcus_faecium
#7 7.841 Sequence type: Nucleotide
#7 7.842 Keep MBits: T
#7 7.842 Maximum file size: 1000000000B
#7 7.842 Adding sequences from FASTA; added 1 sequences in 0.000319004 seconds.
#7 7.848 
#7 7.848 
#7 7.857 makeblastdb -in AMR_DNA-Escherichia -dbtype nucl
#7 7.884 
#7 7.884 
#7 7.884 Building a new DB, current time: 03/09/2023 23:30:59
#7 7.884 New DB name:   /amrfinder/data/2023-02-23.1/AMR_DNA-Escherichia
#7 7.884 New DB title:  AMR_DNA-Escherichia
#7 7.884 Sequence type: Nucleotide
#7 7.885 Keep MBits: T
#7 7.885 Maximum file size: 1000000000B
#7 7.887 Adding sequences from FASTA; added 4 sequences in 0.001405 seconds.
#7 7.891 
#7 7.891 
#7 7.900 makeblastdb -in AMR_DNA-Klebsiella_oxytoca -dbtype nucl
#7 7.928 
#7 7.928 
#7 7.928 Building a new DB, current time: 03/09/2023 23:30:59
#7 7.928 New DB name:   /amrfinder/data/2023-02-23.1/AMR_DNA-Klebsiella_oxytoca
#7 7.928 New DB title:  AMR_DNA-Klebsiella_oxytoca
#7 7.928 Sequence type: Nucleotide
#7 7.929 Keep MBits: T
#7 7.929 Maximum file size: 1000000000B
#7 7.929 Adding sequences from FASTA; added 1 sequences in 0.000295877 seconds.
#7 7.934 
#7 7.934 
#7 7.944 makeblastdb -in AMR_DNA-Neisseria_gonorrhoeae -dbtype nucl
#7 7.971 
#7 7.971 
#7 7.971 Building a new DB, current time: 03/09/2023 23:30:59
#7 7.971 New DB name:   /amrfinder/data/2023-02-23.1/AMR_DNA-Neisseria_gonorrhoeae
#7 7.971 New DB title:  AMR_DNA-Neisseria_gonorrhoeae
#7 7.971 Sequence type: Nucleotide
#7 7.972 Keep MBits: T
#7 7.972 Maximum file size: 1000000000B
#7 7.973 Adding sequences from FASTA; added 6 sequences in 0.00129104 seconds.
#7 7.978 
#7 7.978 
#7 7.988 makeblastdb -in AMR_DNA-Salmonella -dbtype nucl
#7 8.015 
#7 8.015 
#7 8.015 Building a new DB, current time: 03/09/2023 23:30:59
#7 8.015 New DB name:   /amrfinder/data/2023-02-23.1/AMR_DNA-Salmonella
#7 8.015 New DB title:  AMR_DNA-Salmonella
#7 8.015 Sequence type: Nucleotide
#7 8.016 Keep MBits: T
#7 8.016 Maximum file size: 1000000000B
#7 8.017 Adding sequences from FASTA; added 1 sequences in 0.000291109 seconds.
#7 8.023 
#7 8.023 
#7 8.032 makeblastdb -in AMR_DNA-Staphylococcus_aureus -dbtype nucl
#7 8.059 
#7 8.059 
#7 8.059 Building a new DB, current time: 03/09/2023 23:30:59
#7 8.059 New DB name:   /amrfinder/data/2023-02-23.1/AMR_DNA-Staphylococcus_aureus
#7 8.059 New DB title:  AMR_DNA-Staphylococcus_aureus
#7 8.059 Sequence type: Nucleotide
#7 8.060 Keep MBits: T
#7 8.060 Maximum file size: 1000000000B
#7 8.061 Adding sequences from FASTA; added 2 sequences in 0.00118399 seconds.
#7 8.066 
#7 8.066 
#7 8.076 makeblastdb -in AMR_DNA-Streptococcus_pneumoniae -dbtype nucl
#7 8.103 
#7 8.103 
#7 8.103 Building a new DB, current time: 03/09/2023 23:31:00
#7 8.103 New DB name:   /amrfinder/data/2023-02-23.1/AMR_DNA-Streptococcus_pneumoniae
#7 8.103 New DB title:  AMR_DNA-Streptococcus_pneumoniae
#7 8.103 Sequence type: Nucleotide
#7 8.104 Keep MBits: T
#7 8.104 Maximum file size: 1000000000B
#7 8.105 Adding sequences from FASTA; added 1 sequences in 0.000328064 seconds.
#7 8.110 
#7 8.110 
#7 DONE 8.5s

#8 [app 5/5] WORKDIR /data
#8 DONE 0.0s

#9 [test 1/4] RUN amrfinder -l
#9 0.444 Running: amrfinder -l
#9 0.444 Software directory: '/amrfinder/'
#9 0.444 Software version: 3.11.2
#9 0.445 Database directory: '/amrfinder/data/2023-02-23.1'
#9 0.445 Database version: 2023-02-23.1
#9 0.448 
#9 0.448 Available --organism options: Acinetobacter_baumannii, Burkholderia_cepacia, Burkholderia_pseudomallei, Campylobacter, Clostridioides_difficile, Enterococcus_faecalis, Enterococcus_faecium, Escherichia, Klebsiella_oxytoca, Klebsiella_pneumoniae, Neisseria_gonorrhoeae, Neisseria_meningitidis, Pseudomonas_aeruginosa, Salmonella, Staphylococcus_aureus, Staphylococcus_pseudintermedius, Streptococcus_agalactiae, Streptococcus_pneumoniae, Streptococcus_pyogenes, Vibrio_cholerae
#9 DONE 0.5s

#10 [test 2/4] RUN amrfinder --plus -p /amrfinder/test_prot.fa -g  /amrfinder/test_prot.gff -O Escherichia > test_prot.got &&   diff /amrfinder/test_prot.expected test_prot.got &&   amrfinder --plus -n /amrfinder/test_dna.fa -O Escherichia > test_dna.got &&   diff /amrfinder/test_dna.expected test_dna.got &&   amrfinder --plus -n /amrfinder/test_dna.fa -p /amrfinder/test_prot.fa -g /amrfinder/test_prot.gff -O Escherichia > test_both.got &&   diff /amrfinder/test_both.expected test_both.got
#10 0.435 Running: amrfinder --plus -p /amrfinder/test_prot.fa -g /amrfinder/test_prot.gff -O Escherichia
#10 0.435 Software directory: '/amrfinder/'
#10 0.435 Software version: 3.11.2
#10 0.435 Database directory: '/amrfinder/data/2023-02-23.1'
#10 0.435 Database version: 2023-02-23.1
#10 0.435 AMRFinder protein-only and mutation search
#10 0.435   - include -n NUC_FASTA, --nucleotide NUC_FASTA and -g GFF_FILE, --gff GFF_FILE options to add translated searches
#10 0.469 Running blastp...
#10 2.589 Running hmmsearch...
#10 3.667 Making report...
#10 3.762 AMRFinder took 3 seconds to complete
#10 3.772 Running: amrfinder --plus -n /amrfinder/test_dna.fa -O Escherichia
#10 3.772 Software directory: '/amrfinder/'
#10 3.772 Software version: 3.11.2
#10 3.772 Database directory: '/amrfinder/data/2023-02-23.1'
#10 3.773 Database version: 2023-02-23.1
#10 3.773 AMRFinder translated nucleotide and mutation search
#10 3.782 Running blastx...
#10 6.673 Running blastn...
#10 6.784 Making report...
#10 6.934 AMRFinder took 3 seconds to complete
#10 6.942 Running: amrfinder --plus -n /amrfinder/test_dna.fa -p /amrfinder/test_prot.fa -g /amrfinder/test_prot.gff -O Escherichia
#10 6.943 Software directory: '/amrfinder/'
#10 6.943 Software version: 3.11.2
#10 6.943 Database directory: '/amrfinder/data/2023-02-23.1'
#10 6.944 Database version: 2023-02-23.1
#10 6.944 AMRFinder combined translated and protein and mutation search
#10 6.954 Running blastp...
#10 9.003 Running hmmsearch...
#10 10.05 Running blastx...
#10 12.94 Running blastn...
#10 13.04 Making report...
#10 13.25 AMRFinder took 7 seconds to complete
#10 DONE 13.3s

#11 [test 3/4] RUN wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/010/941/835/GCA_010941835.1_PDT000052640.3/GCA_010941835.1_PDT000052640.3_genomic.fna.gz  &&   gzip -d GCA_010941835.1_PDT000052640.3_genomic.fna.gz &&   amrfinder --plus --nucleotide GCA_010941835.1_PDT000052640.3_genomic.fna --output test1.txt &&   amrfinder --plus --nucleotide GCA_010941835.1_PDT000052640.3_genomic.fna --organism Salmonella --output test2.txt &&   cat test1.txt test2.txt
#11 0.389 --2023-03-09 23:31:14--  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/010/941/835/GCA_010941835.1_PDT000052640.3/GCA_010941835.1_PDT000052640.3_genomic.fna.gz
#11 0.397 Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.230, 130.14.250.7, 2607:f220:41e:250::13, ...
#11 0.399 Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.230|:443... connected.
#11 0.452 HTTP request sent, awaiting response... 200 OK
#11 0.587 Length: 1431272 (1.4M) [application/x-gzip]
#11 0.588 Saving to: 'GCA_010941835.1_PDT000052640.3_genomic.fna.gz'
#11 0.601 
#11 0.601      0K .......... .......... .......... .......... ..........  3% 1.77M 1s
#11 0.615     50K .......... .......... .......... .......... ..........  7% 3.54M 1s
#11 0.629    100K .......... .......... .......... .......... .......... 10% 78.5M 0s
#11 0.630    150K .......... .......... .......... .......... .......... 14% 3.74M 0s
#11 0.643    200K .......... .......... .......... .......... .......... 17% 60.8M 0s
#11 0.643    250K .......... .......... .......... .......... .......... 21%  112M 0s
#11 0.644    300K .......... .......... .......... .......... .......... 25%  121M 0s
#11 0.644    350K .......... .......... .......... .......... .......... 28% 3.86M 0s
#11 0.657    400K .......... .......... .......... .......... .......... 32% 41.2M 0s
#11 0.658    450K .......... .......... .......... .......... .......... 35%  139M 0s
#11 0.658    500K .......... .......... .......... .......... .......... 39%  191M 0s
#11 0.660    550K .......... .......... .......... .......... .......... 42%  180M 0s
#11 0.660    600K .......... .......... .......... .......... .......... 46%  185M 0s
#11 0.660    650K .......... .......... .......... .......... .......... 50%  200M 0s
#11 0.660    700K .......... .......... .......... .......... .......... 53%  211M 0s
#11 0.660    750K .......... .......... .......... .......... .......... 57% 4.16M 0s
#11 0.672    800K .......... .......... .......... .......... .......... 60%  101M 0s
#11 0.672    850K .......... .......... .......... .......... .......... 64%  157M 0s
#11 0.672    900K .......... .......... .......... .......... .......... 67%  114M 0s
#11 0.675    950K .......... .......... .......... .......... .......... 71%  196M 0s
#11 0.675   1000K .......... .......... .......... .......... .......... 75%  156M 0s
#11 0.675   1050K .......... .......... .......... .......... .......... 78%  146M 0s
#11 0.675   1100K .......... .......... .......... .......... .......... 82%  169M 0s
#11 0.675   1150K .......... .......... .......... .......... .......... 85%  181M 0s
#11 0.675   1200K .......... .......... .......... .......... .......... 89%  169M 0s
#11 0.675   1250K .......... .......... .......... .......... .......... 93% 6.30M 0s
#11 0.683   1300K .......... .......... .......... .......... .......... 96%  131M 0s
#11 0.683   1350K .......... .......... .......... .......... .......   100%  203M=0.1s
#11 0.683 
#11 0.683 2023-03-09 23:31:14 (14.3 MB/s) - 'GCA_010941835.1_PDT000052640.3_genomic.fna.gz' saved [1431272/1431272]
#11 0.683 
#11 0.741 Running: amrfinder --plus --nucleotide GCA_010941835.1_PDT000052640.3_genomic.fna --output test1.txt
#11 0.741 Software directory: '/amrfinder/'
#11 0.741 Software version: 3.11.2
#11 0.741 Database directory: '/amrfinder/data/2023-02-23.1'
#11 0.741 Database version: 2023-02-23.1
#11 0.741 AMRFinder translated nucleotide search
#11 0.741   - include -O ORGANISM, --organism ORGANISM option to add mutation searches and suppress common proteins
#11 0.943 Running tblastn...
#11 64.56 Making report...
#11 64.74 AMRFinder took 64 seconds to complete
#11 64.75 Running: amrfinder --plus --nucleotide GCA_010941835.1_PDT000052640.3_genomic.fna --organism Salmonella --output test2.txt
#11 64.75 Software directory: '/amrfinder/'
#11 64.75 Software version: 3.11.2
#11 64.75 Database directory: '/amrfinder/data/2023-02-23.1'
#11 64.75 Database version: 2023-02-23.1
#11 64.75 AMRFinder translated nucleotide and mutation search
#11 64.95 Running tblastn...
#11 128.2 Running blastn...
#11 129.4 Making report...
#11 129.6 AMRFinder took 64 seconds to complete
#11 129.6 Protein identifier    Contig id       Start   Stop    Strand  Gene symbol     Sequence name   Scope   Element type    Element subtype Class   Subclass        Method  Target length   Reference sequence length       % Coverage of reference sequence % Identity to reference sequence        Alignment length        Accession of closest sequence   Name of closest sequence        HMM id  HMM description
#11 129.6 NA    AAPBRJ010000001.1       594233  597859  -       iroC    salmochelin/enterobactin export ABC transporter IroC    plus    VIRULENCE       VIRULENCE       NA      NA      BLASTX  1209    1219    98.85   79.85   1211     AUH19662.1      salmochelin/enterobactin export ABC transporter IroC    NA      NA
#11 129.6 NA    AAPBRJ010000001.1       597943  599055  -       iroB    salmochelin biosynthesis C-glycosyltransferase IroB     plus    VIRULENCE       VIRULENCE       NA      NA      BLASTX  371     371     100.00  86.52   371      EOW04219.1      salmochelin biosynthesis C-glycosyltransferase IroB     NA      NA
#11 129.6 NA    AAPBRJ010000002.1       393809  394270  -       golS    Au(I) sensor transcriptional regulator GolS     plus    STRESS  METAL   GOLD    GOLD    EXACTX  154     154     100.00  100.00  154     AAL19308.1      Au(I) sensor transcriptional regulator GolS      NA      NA
#11 129.6 NA    AAPBRJ010000002.1       394285  396570  -       golT    gold/copper-translocating P-type ATPase GolT    plus    STRESS  METAL   COPPER/GOLD     COPPER/GOLD     BLASTX  762     762     100.00  99.61   762     AAL19307.1       gold/copper-translocating P-type ATPase GolT    NA      NA
#11 129.6 NA    AAPBRJ010000002.1       396847  398070  +       mdsA    multidrug efflux RND transporter periplasmic adaptor subunit MdsA       plus    AMR     AMR     EFFLUX  EFFLUX  BLASTX  408     408     100.00  98.28   408      AAL19306.1      multidrug efflux RND transporter periplasmic adaptor subunit MdsA       NA      NA
#11 129.6 NA    AAPBRJ010000002.1       398070  401234  +       mdsB    multidrug efflux RND transporter permease subunit MdsB  plus    AMR     AMR     EFFLUX  EFFLUX  BLASTX  1055    1055    100.00  99.81   1055    AAL19305.1       multidrug efflux RND transporter permease subunit MdsB  NA      NA
#11 129.6 NA    AAPBRJ010000006.1       124903  125433  +       sodC1   superoxide dismutase [Cu-Zn] SodC1      plus    VIRULENCE       VIRULENCE       NA      NA      EXACTX  177     177     100.00  100.00  177     AAL19978.1       superoxide dismutase [Cu-Zn] SodC1      NA      NA
#11 129.6 NA    AAPBRJ010000011.1       69434   70321   +       fieF    CDF family cation-efflux transporter FieF       plus    STRESS  METAL   NA      NA      BLASTX  296     300     98.67   92.57   296     BAE77395.1      CDF family cation-efflux transporter FieF        NA      NA
#11 129.6 NA    AAPBRJ010000014.1       97144   99075   +       sinH    intimin-like inverse autotransporter SinH       plus    VIRULENCE       VIRULENCE       NA      NA      PARTIALX        644     730     88.22   99.84   644      AAL21411.1      intimin-like inverse autotransporter SinH       NA      NA
#11 129.6 NA    AAPBRJ010000032.1       581     1438    +       blaTEM-57       broad-spectrum class A beta-lactamase TEM-57    core    AMR     AMR     BETA-LACTAM     BETA-LACTAM     ALLELEX 286     286     100.00  100.00  286      WP_032492330.1  broad-spectrum class A beta-lactamase TEM-57    NA      NA
#11 129.6 NA    AAPBRJ010000032.1       4081    5277    -       tet(A)  tetracycline efflux MFS transporter Tet(A)      core    AMR     AMR     TETRACYCLINE    TETRACYCLINE    BLASTX  399     399     100.00  99.75   399     WP_000804064.1   tetracycline efflux MFS transporter Tet(A)      NA      NA
#11 129.6 NA    AAPBRJ010000045.1       1480    2676    +       tet(A)  tetracycline efflux MFS transporter Tet(A)      core    AMR     AMR     TETRACYCLINE    TETRACYCLINE    BLASTX  399     399     100.00  99.75   399     WP_000804064.1   tetracycline efflux MFS transporter Tet(A)      NA      NA
#11 129.6 NA    AAPBRJ010000045.1       5319    6176    -       blaTEM-57       broad-spectrum class A beta-lactamase TEM-57    core    AMR     AMR     BETA-LACTAM     BETA-LACTAM     ALLELEX 286     286     100.00  100.00  286      WP_032492330.1  broad-spectrum class A beta-lactamase TEM-57    NA      NA
#11 129.6 Protein identifier    Contig id       Start   Stop    Strand  Gene symbol     Sequence name   Scope   Element type    Element subtype Class   Subclass        Method  Target length   Reference sequence length       % Coverage of reference sequence % Identity to reference sequence        Alignment length        Accession of closest sequence   Name of closest sequence        HMM id  HMM description
#11 129.6 NA    AAPBRJ010000001.1       594233  597859  -       iroC    salmochelin/enterobactin export ABC transporter IroC    plus    VIRULENCE       VIRULENCE       NA      NA      BLASTX  1209    1219    98.85   79.85   1211     AUH19662.1      salmochelin/enterobactin export ABC transporter IroC    NA      NA
#11 129.6 NA    AAPBRJ010000001.1       597943  599055  -       iroB    salmochelin biosynthesis C-glycosyltransferase IroB     plus    VIRULENCE       VIRULENCE       NA      NA      BLASTX  371     371     100.00  86.52   371      EOW04219.1      salmochelin biosynthesis C-glycosyltransferase IroB     NA      NA
#11 129.6 NA    AAPBRJ010000002.1       393809  394270  -       golS    Au(I) sensor transcriptional regulator GolS     plus    STRESS  METAL   GOLD    GOLD    EXACTX  154     154     100.00  100.00  154     AAL19308.1      Au(I) sensor transcriptional regulator GolS      NA      NA
#11 129.6 NA    AAPBRJ010000002.1       394285  396570  -       golT    gold/copper-translocating P-type ATPase GolT    plus    STRESS  METAL   COPPER/GOLD     COPPER/GOLD     BLASTX  762     762     100.00  99.61   762     AAL19307.1       gold/copper-translocating P-type ATPase GolT    NA      NA
#11 129.6 NA    AAPBRJ010000002.1       396847  398070  +       mdsA    multidrug efflux RND transporter periplasmic adaptor subunit MdsA       plus    AMR     AMR     EFFLUX  EFFLUX  BLASTX  408     408     100.00  98.28   408      AAL19306.1      multidrug efflux RND transporter periplasmic adaptor subunit MdsA       NA      NA
#11 129.6 NA    AAPBRJ010000002.1       398070  401234  +       mdsB    multidrug efflux RND transporter permease subunit MdsB  plus    AMR     AMR     EFFLUX  EFFLUX  BLASTX  1055    1055    100.00  99.81   1055    AAL19305.1       multidrug efflux RND transporter permease subunit MdsB  NA      NA
#11 129.6 NA    AAPBRJ010000006.1       124903  125433  +       sodC1   superoxide dismutase [Cu-Zn] SodC1      plus    VIRULENCE       VIRULENCE       NA      NA      EXACTX  177     177     100.00  100.00  177     AAL19978.1       superoxide dismutase [Cu-Zn] SodC1      NA      NA
#11 129.6 NA    AAPBRJ010000008.1       160594  163227  +       gyrA_S83Y       Salmonella quinolone resistant GyrA     core    AMR     POINT   QUINOLONE       QUINOLONE       POINTX  878     878     100.00  99.89   878     WP_001281271.1   DNA gyrase subunit A GyrA       NA      NA
#11 129.6 NA    AAPBRJ010000014.1       97144   99075   +       sinH    intimin-like inverse autotransporter SinH       plus    VIRULENCE       VIRULENCE       NA      NA      PARTIALX        644     730     88.22   99.84   644      AAL21411.1      intimin-like inverse autotransporter SinH       NA      NA
#11 129.6 NA    AAPBRJ010000032.1       581     1438    +       blaTEM-57       broad-spectrum class A beta-lactamase TEM-57    core    AMR     AMR     BETA-LACTAM     BETA-LACTAM     ALLELEX 286     286     100.00  100.00  286      WP_032492330.1  broad-spectrum class A beta-lactamase TEM-57    NA      NA
#11 129.6 NA    AAPBRJ010000032.1       4081    5277    -       tet(A)  tetracycline efflux MFS transporter Tet(A)      core    AMR     AMR     TETRACYCLINE    TETRACYCLINE    BLASTX  399     399     100.00  99.75   399     WP_000804064.1   tetracycline efflux MFS transporter Tet(A)      NA      NA
#11 129.6 NA    AAPBRJ010000045.1       1480    2676    +       tet(A)  tetracycline efflux MFS transporter Tet(A)      core    AMR     AMR     TETRACYCLINE    TETRACYCLINE    BLASTX  399     399     100.00  99.75   399     WP_000804064.1   tetracycline efflux MFS transporter Tet(A)      NA      NA
#11 129.6 NA    AAPBRJ010000045.1       5319    6176    -       blaTEM-57       broad-spectrum class A beta-lactamase TEM-57    core    AMR     AMR     BETA-LACTAM     BETA-LACTAM     ALLELEX 286     286     100.00  100.00  286      WP_032492330.1  broad-spectrum class A beta-lactamase TEM-57    NA      NA
#11 DONE 129.6s

#12 [test 4/4] RUN wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/812/925/GCA_003812925.1_ASM381292v1/GCA_003812925.1_ASM381292v1_genomic.fna.gz &&   gzip -d GCA_003812925.1_ASM381292v1_genomic.fna.gz &&   amrfinder --plus --name GCA_003812925.1 -n GCA_003812925.1_ASM381292v1_genomic.fna -O Klebsiella_oxytoca -o GCA_003812925.1-amrfinder.tsv
#12 0.430 --2023-03-09 23:33:24--  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/812/925/GCA_003812925.1_ASM381292v1/GCA_003812925.1_ASM381292v1_genomic.fna.gz
#12 0.435 Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.7, 130.14.250.10, 2607:f220:41f:250::229, ...
#12 0.495 Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.7|:443... connected.
#12 0.550 HTTP request sent, awaiting response... 200 OK
#12 0.570 Length: 1741196 (1.7M) [application/x-gzip]
#12 0.571 Saving to: 'GCA_003812925.1_ASM381292v1_genomic.fna.gz'
#12 0.586 
#12 0.586      0K .......... .......... .......... .......... ..........  2% 1.55M 1s
#12 0.602     50K .......... .......... .......... .......... ..........  5% 3.08M 1s
#12 0.618    100K .......... .......... .......... .......... ..........  8% 3.12M 1s
#12 0.634    150K .......... .......... .......... .......... .......... 11%  130M 0s
#12 0.634    200K .......... .......... .......... .......... .......... 14%  200M 0s
#12 0.634    250K .......... .......... .......... .......... .......... 17%  242M 0s
#12 0.635    300K .......... .......... .......... .......... .......... 20%  373K 1s
#12 0.770    350K .......... .......... .......... .......... .......... 23%  154M 1s
#12 0.770    400K .......... .......... .......... .......... .......... 26%  128M 1s
#12 0.770    450K .......... .......... .......... .......... .......... 29%  174M 0s
#12 0.770    500K .......... .......... .......... .......... .......... 32%  137M 0s
#12 0.770    550K .......... .......... .......... .......... .......... 35%  126M 0s
#12 0.770    600K .......... .......... .......... .......... .......... 38% 1.87M 0s
#12 0.796    650K .......... .......... .......... .......... .......... 41%  109M 0s
#12 0.797    700K .......... .......... .......... .......... .......... 44%  135M 0s
#12 0.797    750K .......... .......... .......... .......... .......... 47%  146M 0s
#12 0.797    800K .......... .......... .......... .......... .......... 49% 92.3M 0s
#12 0.798    850K .......... .......... .......... .......... .......... 52%  149M 0s
#12 0.798    900K .......... .......... .......... .......... .......... 55%  143M 0s
#12 0.799    950K .......... .......... .......... .......... .......... 58% 30.9M 0s
#12 0.800   1000K .......... .......... .......... .......... .......... 61% 3.44M 0s
#12 0.815   1050K .......... .......... .......... .......... .......... 64%  105M 0s
#12 0.815   1100K .......... .......... .......... .......... .......... 67%  125M 0s
#12 0.815   1150K .......... .......... .......... .......... .......... 70%  124M 0s
#12 0.816   1200K .......... .......... .......... .......... .......... 73%  118M 0s
#12 0.816   1250K .......... .......... .......... .......... .......... 76% 2.11M 0s
#12 0.839   1300K .......... .......... .......... .......... .......... 79% 34.0M 0s
#12 0.841   1350K .......... .......... .......... .......... .......... 82% 75.3M 0s
#12 0.841   1400K .......... .......... .......... .......... .......... 85%  170M 0s
#12 0.841   1450K .......... .......... .......... .......... .......... 88%  202M 0s
#12 0.842   1500K .......... .......... .......... .......... .......... 91%  225M 0s
#12 0.842   1550K .......... .......... .......... .......... .......... 94%  117M 0s
#12 0.843   1600K .......... .......... .......... .......... .......... 97% 62.8M 0s
#12 0.843   1650K .......... .......... .......... .......... .......... 99%  138M 0s
#12 0.843   1700K                                                       100%  738G=0.3s
#12 0.843 
#12 0.844 2023-03-09 23:33:24 (6.09 MB/s) - 'GCA_003812925.1_ASM381292v1_genomic.fna.gz' saved [1741196/1741196]
#12 0.844 
#12 0.906 Running: amrfinder --plus --name GCA_003812925.1 -n GCA_003812925.1_ASM381292v1_genomic.fna -O Klebsiella_oxytoca -o GCA_003812925.1-amrfinder.tsv
#12 0.906 Software directory: '/amrfinder/'
#12 0.906 Software version: 3.11.2
#12 0.907 Database directory: '/amrfinder/data/2023-02-23.1'
#12 0.907 Database version: 2023-02-23.1
#12 0.907 AMRFinder translated nucleotide and mutation search
#12 1.137 Running tblastn...
#12 246.2 Running blastn...
#12 247.6 Making report...
#12 247.8 AMRFinder took 247 seconds to complete
kapsakcj commented 1 year ago

If you want to see the full dockerfile, I have it on an open PR over here: https://github.com/StaPH-B/docker-builds/pull/631/files

The dockerfile I'm describing above ^ is ncbi-amrfinderplus/3.11.2-2023-02-23.1/Dockerfile

evolarjun commented 1 year ago

Hi Curtis,

Your approach looks good to me. The AMR_DNA-* files are only used if there are point mutations only identified by DNA (e.g., 16S or promoter mutations) for that organism so your approach looks good. Here's a shell script I used to rebuild the database prior to having amrfinder_index which uses a slightly different approach from yours but with, I think, the same effect:

#!/bin/bash

# hmmer
echo "hmmpress -f AMR.LIB"
hmmpress -f AMR.LIB

# the DNA files
for tg in `cut -f 1 taxgroup.tab | grep -v '^#'`
do
    fasta_dna="AMR_DNA-$tg"
    if [ -e "$fasta_dna" ]
    then
        echo "makeblastdb -in $fasta_dna -dbtype nucl"
        makeblastdb -in $fasta_dna -dbtype nucl
    fi
done

# AMRProt
echo "makeblastdb -in AMRProt -dbtype prot"
makeblastdb -in AMRProt -dbtype prot

Also, I quickly skimmed through your Dockerfile (I really only dabble in Docker, so I'm no expert) and it otherwise looks good to me. I noticed you're running your own tests. We also distribute a test script (test_amrfinder.sh), though I'm not sure if you want to use it because there can, occasionally, be interactions between the version of the database and the expected test output. That said your tests look fine, and since you're pinning a database and software version for each directory you don't have to worry about changes in database versions changing expected test output.

kapsakcj commented 1 year ago

Great, thanks for taking a look and sharing the code! That test script is useful, would be nice to incorporate that in the future.

The first tests run in our dockerfile should capture the same behavior - run the amrfinder test files, followed by diff and exit/fail to build the docker image if exit code is >0

Thanks so much!