Closed nick-youngblut closed 1 year ago
Normally this command is enough:
amrfinder -u
Normally this command is enough:
I'm trying to avoid downloading the database for each job, which doesn't scale well if running amrfinder separately on 1000's of genomes.
It also assumes internet access, which is not available on many compute clusters.
There's also the added headache of write permissions for writing files into a codebase directory (e.g., in a Singularity or Docker image):
Running: amrfinder -u
Software directory: '/opt/conda/bin/'
Software version: 3.11.14
Running: /opt/conda/bin/amrfinder_update -d /opt/conda/share/amrfinderplus/data
Looking up the published databases at https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/
*** ERROR ***
Cannot create directory "/opt/conda/share/amrfinderplus"
... which is why this is generally avoided for software (databases & data files separated from the code).
It seems like bioinformatics tool developers like to keep databases with the code (make it "easy" for users), but it always leads to many issues.
When I use amrfinder --database
with the pre-download database, I'm getting the following error:
Running: amrfinder -i 0.1 -c 0.1 --database amr_finder_plus --nucleotide fbi00002.fna --output results.tsv
Software directory: '/opt/conda/bin/'
Software version: 3.11.14
Database directory: '/workspaces/genome_annotation/tmp/work/bf/482ce3ef89f362aecc9b9938813a90/amr_finder_plus'
*** ERROR ***
The BLAST database for AMRProt was not found. Use amrfinder -u to download and prepare database for AMRFinderPlus
AMRProt
is in the provided database directory
Hi @nick-youngblut ,
I'm not sure why you would need to download the database for each job. Could you elaborate on what you're trying to do? You can rebuild the database using amrfinder_index
(documentation), but I think Slava is right that running amrfinder -u
is usually preferable if you want to check for updates on every run. amrfinder -u
will check for a new version and not re-do the download and build unless a newer database version is available.
Could you elaborate on what you're trying to do? The usual process is to only download and update the database when a new release comes out.
will check for a new version and not re-do the download and build unless a newer database version is available.
The reasons, summarized and expanded from above:
-u
write to a location that the user may not have write permissions (e.g., a pre-build image) We also publish a docker image with the database included, and you can see how it's built if you want to use your own Dockerfile. See https://github.com/ncbi/docker/tree/master/amr.
The BLAST database for AMRProt was not found.
There is probably the file AMRProt
, but not AMRProt.pdb
etc. created by makeblastdb
.
Can you install the AMRFinderPlus database once and run AMRFinderPlus many times with the same database?
There is probably the file AMRProt, but not AMRProt.pdb etc. created by makeblastdb
Thanks @vbrover for the clarification! So the database download instructions should include running makeblastdb
after downloading the database? The only mention of this can I can find in the repo is in https://github.com/ncbi/amr/issues/112:
But you need to run makeblastdb and hmmpress and choose the compatible historic software because the latest one may not work with an old database.
Hi @nick-youngblut,
Just to clarify we don't, and we don't expect anyone else to run amrfinder -u
except a every few months when we release new database versions. Some people like to check for updates every time they build a new container or just for peace of mind, so we provide that functionality.
I agree with your reasons about why not to update the database every time you run the software, and would add that we also don't want that load on our servers, so we appreciate your not building pipelines and containers that download the database every time they are run.
Arjun
so we appreciate your not building pipelines and containers that download the database every time they are run.
I'm guessing that most pipeline developers just use:
amrfinder -u
amrfinder --nucleotide genome.fna --output results.tsv
...since that method is currently the easiest approach.
In fact, that is how my team has been doing it for quite a while now. I'm trying to update the code so that the database is not downloaded/updated for each genome during every run of our pipeline.
Adding a --download-dir
option would be quite helpful: the user runs amrfinder --download-dir
to download the database to a location of the user's choosing, and run makeblastdb & hmmpress (with the correct software versions) to fully generate the database
For running makeblastdb
on the database, it appears that makeblastdb
must be run multiple times:
stderr << "Indexing" << "\n";
exec (fullProg ("hmmpress") + " -f " + shellQuote (dbDir + "AMR.LIB") + " > /dev/null 2> " + tmp + "/hmmpress.err", tmp + "/hmmpress.err");
setSymlink (dbDir, tmp + "/db", true);
exec (fullProg ("makeblastdb") + " -in " + tmp + "/db/AMRProt" + " -dbtype prot -logfile " + tmp + "/makeblastdb.AMRProt", tmp + "/makeblastdb.AMRProt");
exec (fullProg ("makeblastdb") + " -in " + tmp + "/db/AMR_CDS" + " -dbtype nucl -logfile " + tmp + "/makeblastdb.AMR_CDS", tmp + "/makeblastdb.AMR_CDS");
for (const string& dnaPointMut : dnaPointMuts)
exec (fullProg ("makeblastdb") + " -in " + tmp + "/db/AMR_DNA-" + dnaPointMut + " -dbtype nucl -logfile " + tmp + "/makeblastdb.AMR_DNA-" + dnaPointMut, tmp + "/makeblastdb.AMR_DNA-" + dnaPointMut);
}
Is this correct?
Hi @nick-youngblut,
Thanks for updating your pipelines to not abuse our servers. See the documentation for amrfinder_index and/or amrfinder_update for the functionality you're looking for.
Arjun
So the database download instructions should include running makeblastdb after downloading the database?
No, the database download instructions should not include running makeblastdb after downloading the database.
Because makeblastdb
is included in the downloading the database by
amrfinder -u
or
mkdir abc
amrfinder_update -d abc
Thanks @vbrover and @evolarjun for all of your help!
I'm getting the following error when running amrfinder_update
:
*** ERROR ***
CURL: Cannot read
from https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/
code=6
error: Could not resolve host: ftp.ncbi.nlm.nih.gov
version: 7.88.1
Update:
The network connect error seems to be intermittent. I'm also getting the following error:
Running: amrfinder_update --force_update --database ./2023-04-17.1
Looking up the published databases at https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/
*** ERROR ***
Cannot create the root directory
HOSTNAME: cad5799d0e69
SHELL: ?
PWD: /workspaces/genome_annotation/tmp/work/5a/fa57babb05c17451a034eca10caae1
PATH: /opt/conda/bin:/opt/conda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/workspaces/genome_annotation/bin
Progam name: amrfinder_update
Command line: amrfinder_update --force_update --database ./2023-04-17.1
It appears that amrfinder_update
is not just trying to write the database to my current working directory.
So, it is the first time you running amrfinder -u
?
What is the result of this command?
ls -laF ./2023-04-17.1
So, it is the first time you running amrfinder -u?
@vbrover I have not set up my docker image to allow users of the image to write to the code install location; hence, the error:
*** ERROR ***
Cannot create directory "/opt/conda/share/amrfinderplus"
I can alter the permissions for /opt/conda/share/
in my dockerfile, but I'm trying to avoid this approach, since one shouldn't need to write files (eg., database files) within a docker image.
I personally dot work with docker or conda, but only do amrfinder-u
or amrfinder_update
to get the database.
So, do you want to make amrfinder_update
work?
I personally dot work with docker or conda, but only do amrfinder-u or amrfinder_update to get the database.
I highly recommend using them, since they can help tremendously with creating reproducible developer & CI testing environments (e.g., VS Code devcontainers or GitHub actions).
So, do you want to make amrfinder_update work?
Yes. Will amrfinder_update
just write to a given output directory, or must it write (something) to the amrfinder
install location?
Will amrfinder_update just write to a given output directory,
Yes. If it exists.
Yes. If it exists.
I don't code in C much, but isn't there an equivalent of os.makedirs('my_new_directory', exist_ok=True)
for C? That way, the user doesn't get an odd error message if the new db directory doesn't exist.
I have just made amrfinder_update
to print a better message.
That will be in amfinder
ver. 3.11.15.
$ amrfinder_update -d abc
Running: amrfinder_update -d abc
Looking up the published databases at https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/
Looking for the target directory abc/2023-04-17.1/
*** ERROR ***
Cannot create the directory abc
HOSTNAME: iebdev21
SHELL: /bin/bash
PWD: /home/brovervv/work/AMR/AMRFinder
PATH: /opt/python-all/bin:/opt/fcron/bin:/net/snowman/vol/projects/trace_software/vdb/linux/release/x86_64/bin:/usr/local/valgrind/3.20/bin:/usr/local/uclust/1.2.22/bin:/netmnt/gridengine/current/bin/lx-amd64:/usr/local/samtools/1.14/bin:/usr/local/rdp_classifier/2.10.1/bin:/usr/local/RAxML/7.7.2/bin:/usr/local/phylip/3.69/exe:/usr/local/phylip/3.69/bin:/usr/local/paup/4.10/bin:/usr/local/muscle/3.8.31/bin:/usr/local/infernal/1.1.2/bin:/usr/local/hmmer/3.3.2/bin:/usr/local/gmes/4.39/bin:/usr/local/perl/5.16.3/bin:/opt/perl/5.16.3/bin:/usr/local/subversion/1.10.6/bin:/usr/local/svnmucc/1.5.7/bin:/usr/local/ninja/1.10.2/bin:/usr/local/nedit/5.5/bin:/netopt/ncbi_tools64/bin:/am/ncbiapdata/bin:/usr/local/joe/3.7/bin:/usr/local/git/2.38.3/bin:/opt/ncbi/gcc/7.3.0/bin:/usr/local/ddd/3.3.12/bin:/usr/local/ctags/5.8/bin:/usr/local/cmake/3.21.2/bin:/usr/local/clustalx/1.83/bin:/usr/local/bwa/0.7.17/bin:/usr/local/Mash/1.0.2/bin:/usr/local/capnproto/0.5.3/bin:/usr/local/Bandage/0.8.1/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/sybase/clients/current/bin:/opt/sybase/utils/bin:/opt/puppetlabs/bin:/opt/dell/srvadmin/bin:/netopt/genbank/subtool/bin:.:/home/brovervv/code:/home/brovervv/code/cpp:/home/brovervv/code/cpp/tsv:/home/brovervv/code/cpp/dm:/home/brovervv/code/cpp/dm/conversion:/home/brovervv/code/cpp/phylogeny:/home/brovervv/code/cpp/xml:/home/brovervv/code/cpp/genetics:/home/brovervv/code/cpp/dissim:/home/brovervv/code/genetics:/home/brovervv/code/amrfinder:/home/brovervv/code/database:/home/brovervv/code/LINX:/home/brovervv/code/mongodb
Progam name: amrfinder_update
Command line: amrfinder_update -d abc
$ mkdir abc
$ amrfinder_update -d abc
Running: amrfinder_update -d abc
Looking up the published databases at https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/
Looking for the target directory abc/2023-04-17.1/
Downloading AMRFinder database version 2023-04-17.1 into 'abc/2023-04-17.1/'
Running: /home/brovervv/code/amrfinder/amrfinder_index abc/2023-04-17.1/
Indexing
$ ls -laF abc
total 24
drwxrwxr-x 3 brovervv pathogen 4096 May 23 16:54 ./
drwxrwxr-x 12 brovervv genomes 8192 May 23 16:54 ../
drwxrwxr-x 2 brovervv pathogen 12288 May 23 16:54 2023-04-17.1/
lrwxrwxrwx 1 brovervv pathogen 12 May 23 16:54 latest -> 2023-04-17.1/
I have just made amrfinder_update to print a better message.
That's great! Thanks @vbrover for your help! I've now updated our pipeline to use the database downloaded via amrfinder_update
I have made change: amrfinder_update -d abc
will create and populate the directory abc
if it does not exist (in ver. 3.11.15).
I think all of the issues have been resolved here. Please let us know and/or reopen if we're missing something.
The wiki states:
However, there doesn't seem to be any instructions on actually downloading the database via the command line (e.g., with
wget
). Downloading the database is non-trivial, especially given the limitation on bots: