ncbi / amr

AMRFinderPlus - Identify AMR genes and point mutations, and virulence and stress resistance genes in assembled bacterial nucleotide and protein sequence.
https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder/
Other
265 stars 37 forks source link

instructions on database download via command line #124

Closed nick-youngblut closed 11 months ago

nick-youngblut commented 1 year ago

The wiki states:

The most recent database release can be found in https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/latest and a log of changes with each release is available in the changes.txt file. Note that this database is compiled as part of the National Database of Antibiotic Resistant Organisms (NDARO) and more user-friendly access to the data is available at https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/. The rest of this document describes the format and structure of the database as it used by AMRFinderPlus.

However, there doesn't seem to be any instructions on actually downloading the database via the command line (e.g., with wget). Downloading the database is non-trivial, especially given the limitation on bots:

wget -e robots=off -r -np -nH --cut-dirs=6 -R index.html \
https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/latest/
vbrover commented 1 year ago

Normally this command is enough:

amrfinder -u
nick-youngblut commented 1 year ago

Normally this command is enough:

I'm trying to avoid downloading the database for each job, which doesn't scale well if running amrfinder separately on 1000's of genomes.

It also assumes internet access, which is not available on many compute clusters.

There's also the added headache of write permissions for writing files into a codebase directory (e.g., in a Singularity or Docker image):

Running: amrfinder -u
Software directory: '/opt/conda/bin/'
Software version: 3.11.14
Running: /opt/conda/bin/amrfinder_update -d /opt/conda/share/amrfinderplus/data
Looking up the published databases at https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/

*** ERROR ***
Cannot create directory "/opt/conda/share/amrfinderplus"

... which is why this is generally avoided for software (databases & data files separated from the code).

It seems like bioinformatics tool developers like to keep databases with the code (make it "easy" for users), but it always leads to many issues.

nick-youngblut commented 1 year ago

When I use amrfinder --database with the pre-download database, I'm getting the following error:

Running: amrfinder -i 0.1 -c 0.1 --database amr_finder_plus --nucleotide fbi00002.fna --output results.tsv
Software directory: '/opt/conda/bin/'
Software version: 3.11.14
Database directory: '/workspaces/genome_annotation/tmp/work/bf/482ce3ef89f362aecc9b9938813a90/amr_finder_plus'

*** ERROR ***
The BLAST database for AMRProt was not found. Use amrfinder -u to download and prepare database for AMRFinderPlus

AMRProt is in the provided database directory

evolarjun commented 1 year ago

Hi @nick-youngblut ,

I'm not sure why you would need to download the database for each job. Could you elaborate on what you're trying to do? You can rebuild the database using amrfinder_index (documentation), but I think Slava is right that running amrfinder -u is usually preferable if you want to check for updates on every run. amrfinder -u will check for a new version and not re-do the download and build unless a newer database version is available.

Could you elaborate on what you're trying to do? The usual process is to only download and update the database when a new release comes out.

nick-youngblut commented 1 year ago

will check for a new version and not re-do the download and build unless a newer database version is available.

The reasons, summarized and expanded from above:

evolarjun commented 1 year ago

We also publish a docker image with the database included, and you can see how it's built if you want to use your own Dockerfile. See https://github.com/ncbi/docker/tree/master/amr.

vbrover commented 1 year ago

The BLAST database for AMRProt was not found.

There is probably the file AMRProt, but not AMRProt.pdb etc. created by makeblastdb.

vbrover commented 1 year ago

Can you install the AMRFinderPlus database once and run AMRFinderPlus many times with the same database?

nick-youngblut commented 1 year ago

There is probably the file AMRProt, but not AMRProt.pdb etc. created by makeblastdb

Thanks @vbrover for the clarification! So the database download instructions should include running makeblastdb after downloading the database? The only mention of this can I can find in the repo is in https://github.com/ncbi/amr/issues/112:

But you need to run makeblastdb and hmmpress and choose the compatible historic software because the latest one may not work with an old database.

evolarjun commented 1 year ago

Hi @nick-youngblut,

Just to clarify we don't, and we don't expect anyone else to run amrfinder -u except a every few months when we release new database versions. Some people like to check for updates every time they build a new container or just for peace of mind, so we provide that functionality.

I agree with your reasons about why not to update the database every time you run the software, and would add that we also don't want that load on our servers, so we appreciate your not building pipelines and containers that download the database every time they are run.

Arjun

nick-youngblut commented 1 year ago

so we appreciate your not building pipelines and containers that download the database every time they are run.

I'm guessing that most pipeline developers just use:

amrfinder -u
amrfinder --nucleotide genome.fna --output results.tsv

...since that method is currently the easiest approach.

In fact, that is how my team has been doing it for quite a while now. I'm trying to update the code so that the database is not downloaded/updated for each genome during every run of our pipeline.

nick-youngblut commented 1 year ago

Adding a --download-dir option would be quite helpful: the user runs amrfinder --download-dir to download the database to a location of the user's choosing, and run makeblastdb & hmmpress (with the correct software versions) to fully generate the database

nick-youngblut commented 1 year ago

For running makeblastdb on the database, it appears that makeblastdb must be run multiple times:

    stderr << "Indexing" << "\n";
    exec (fullProg ("hmmpress") + " -f " + shellQuote (dbDir + "AMR.LIB") + " > /dev/null 2> " + tmp + "/hmmpress.err", tmp + "/hmmpress.err");
    setSymlink (dbDir, tmp + "/db", true);
      exec (fullProg ("makeblastdb") + " -in " + tmp + "/db/AMRProt" + "  -dbtype prot  -logfile " + tmp + "/makeblastdb.AMRProt", tmp + "/makeblastdb.AMRProt");  
      exec (fullProg ("makeblastdb") + " -in " + tmp + "/db/AMR_CDS" + "  -dbtype nucl  -logfile " + tmp + "/makeblastdb.AMR_CDS", tmp + "/makeblastdb.AMR_CDS");  
    for (const string& dnaPointMut : dnaPointMuts)
      exec (fullProg ("makeblastdb") + " -in " + tmp + "/db/AMR_DNA-" + dnaPointMut + "  -dbtype nucl  -logfile " + tmp + "/makeblastdb.AMR_DNA-" + dnaPointMut, tmp + "/makeblastdb.AMR_DNA-" + dnaPointMut);
  }

Is this correct?

evolarjun commented 1 year ago

Hi @nick-youngblut,

Thanks for updating your pipelines to not abuse our servers. See the documentation for amrfinder_index and/or amrfinder_update for the functionality you're looking for.

Arjun

vbrover commented 1 year ago

So the database download instructions should include running makeblastdb after downloading the database?

No, the database download instructions should not include running makeblastdb after downloading the database. Because makeblastdb is included in the downloading the database by

amrfinder -u

or

mkdir abc
amrfinder_update -d abc
nick-youngblut commented 1 year ago

Thanks @vbrover and @evolarjun for all of your help!

I'm getting the following error when running amrfinder_update:

*** ERROR ***
CURL: Cannot read
  from https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/
  code=6
  error: Could not resolve host: ftp.ncbi.nlm.nih.gov
  version: 7.88.1

Update:

The network connect error seems to be intermittent. I'm also getting the following error:

Running: amrfinder_update --force_update --database ./2023-04-17.1
Looking up the published databases at https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/

*** ERROR ***
Cannot create the root directory

HOSTNAME: cad5799d0e69
SHELL: ?
PWD: /workspaces/genome_annotation/tmp/work/5a/fa57babb05c17451a034eca10caae1
PATH: /opt/conda/bin:/opt/conda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/workspaces/genome_annotation/bin
Progam name:  amrfinder_update
Command line: amrfinder_update --force_update --database ./2023-04-17.1

It appears that amrfinder_update is not just trying to write the database to my current working directory.

vbrover commented 1 year ago

So, it is the first time you running amrfinder -u?

What is the result of this command?

ls -laF ./2023-04-17.1
nick-youngblut commented 1 year ago

So, it is the first time you running amrfinder -u?

@vbrover I have not set up my docker image to allow users of the image to write to the code install location; hence, the error:

*** ERROR ***
Cannot create directory "/opt/conda/share/amrfinderplus"

I can alter the permissions for /opt/conda/share/ in my dockerfile, but I'm trying to avoid this approach, since one shouldn't need to write files (eg., database files) within a docker image.

vbrover commented 1 year ago

I personally dot work with docker or conda, but only do amrfinder-u or amrfinder_update to get the database.

So, do you want to make amrfinder_update work?

nick-youngblut commented 1 year ago

I personally dot work with docker or conda, but only do amrfinder-u or amrfinder_update to get the database.

I highly recommend using them, since they can help tremendously with creating reproducible developer & CI testing environments (e.g., VS Code devcontainers or GitHub actions).

So, do you want to make amrfinder_update work?

Yes. Will amrfinder_update just write to a given output directory, or must it write (something) to the amrfinder install location?

vbrover commented 1 year ago

Will amrfinder_update just write to a given output directory,

Yes. If it exists.

nick-youngblut commented 1 year ago

Yes. If it exists.

I don't code in C much, but isn't there an equivalent of os.makedirs('my_new_directory', exist_ok=True) for C? That way, the user doesn't get an odd error message if the new db directory doesn't exist.

vbrover commented 1 year ago

I have just made amrfinder_update to print a better message. That will be in amfinder ver. 3.11.15.

$ amrfinder_update -d abc 
Running: amrfinder_update -d abc
Looking up the published databases at https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/
Looking for the target directory abc/2023-04-17.1/

*** ERROR ***
Cannot create the directory abc

HOSTNAME: iebdev21
SHELL: /bin/bash
PWD: /home/brovervv/work/AMR/AMRFinder
PATH: /opt/python-all/bin:/opt/fcron/bin:/net/snowman/vol/projects/trace_software/vdb/linux/release/x86_64/bin:/usr/local/valgrind/3.20/bin:/usr/local/uclust/1.2.22/bin:/netmnt/gridengine/current/bin/lx-amd64:/usr/local/samtools/1.14/bin:/usr/local/rdp_classifier/2.10.1/bin:/usr/local/RAxML/7.7.2/bin:/usr/local/phylip/3.69/exe:/usr/local/phylip/3.69/bin:/usr/local/paup/4.10/bin:/usr/local/muscle/3.8.31/bin:/usr/local/infernal/1.1.2/bin:/usr/local/hmmer/3.3.2/bin:/usr/local/gmes/4.39/bin:/usr/local/perl/5.16.3/bin:/opt/perl/5.16.3/bin:/usr/local/subversion/1.10.6/bin:/usr/local/svnmucc/1.5.7/bin:/usr/local/ninja/1.10.2/bin:/usr/local/nedit/5.5/bin:/netopt/ncbi_tools64/bin:/am/ncbiapdata/bin:/usr/local/joe/3.7/bin:/usr/local/git/2.38.3/bin:/opt/ncbi/gcc/7.3.0/bin:/usr/local/ddd/3.3.12/bin:/usr/local/ctags/5.8/bin:/usr/local/cmake/3.21.2/bin:/usr/local/clustalx/1.83/bin:/usr/local/bwa/0.7.17/bin:/usr/local/Mash/1.0.2/bin:/usr/local/capnproto/0.5.3/bin:/usr/local/Bandage/0.8.1/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/sybase/clients/current/bin:/opt/sybase/utils/bin:/opt/puppetlabs/bin:/opt/dell/srvadmin/bin:/netopt/genbank/subtool/bin:.:/home/brovervv/code:/home/brovervv/code/cpp:/home/brovervv/code/cpp/tsv:/home/brovervv/code/cpp/dm:/home/brovervv/code/cpp/dm/conversion:/home/brovervv/code/cpp/phylogeny:/home/brovervv/code/cpp/xml:/home/brovervv/code/cpp/genetics:/home/brovervv/code/cpp/dissim:/home/brovervv/code/genetics:/home/brovervv/code/amrfinder:/home/brovervv/code/database:/home/brovervv/code/LINX:/home/brovervv/code/mongodb
Progam name:  amrfinder_update
Command line: amrfinder_update -d abc

$ mkdir abc

$ amrfinder_update -d abc
Running: amrfinder_update -d abc
Looking up the published databases at https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/
Looking for the target directory abc/2023-04-17.1/
Downloading AMRFinder database version 2023-04-17.1 into 'abc/2023-04-17.1/'
Running: /home/brovervv/code/amrfinder/amrfinder_index abc/2023-04-17.1/
Indexing

$ ls -laF abc
total 24
drwxrwxr-x  3 brovervv pathogen  4096 May 23 16:54 ./
drwxrwxr-x 12 brovervv genomes   8192 May 23 16:54 ../
drwxrwxr-x  2 brovervv pathogen 12288 May 23 16:54 2023-04-17.1/
lrwxrwxrwx  1 brovervv pathogen    12 May 23 16:54 latest -> 2023-04-17.1/
nick-youngblut commented 1 year ago

I have just made amrfinder_update to print a better message.

That's great! Thanks @vbrover for your help! I've now updated our pipeline to use the database downloaded via amrfinder_update

vbrover commented 1 year ago

I have made change: amrfinder_update -d abc will create and populate the directory abc if it does not exist (in ver. 3.11.15).

evolarjun commented 11 months ago

I think all of the issues have been resolved here. Please let us know and/or reopen if we're missing something.