soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.35k stars 190 forks source link

mmseqs createtaxdb unexpectedly killed #811

Open charlesfoster opened 6 months ago

charlesfoster commented 6 months ago

I'm trying to use a clustered version of the NR database for taxonomy assignment but am running into some issues. Any assistance would be appreciated.

Expected Behavior

When running mmseqs createtaxdb db_name tmp --tax-mapping-file taxid.map, I would expect to successfully create a seqTaxDB as per here.

Current Behavior

The job begins but is unexpectedly killed (see mmseqs output section below).

MMseqs Output (for bugs)

cfos@pop-os:/data/clustered_nr$ mmseqs createtaxdb nr_rep_seq_db tmp --tax-mapping-file '/data/clustered_nr/nr_rep_seq_to_taxid.map' -v 3
Create directory tmp
createtaxdb nr_rep_seq_db tmp --tax-mapping-file /data/clustered_nr/nr_rep_seq_to_taxid.map -v 3 

MMseqs Version:         2fad714b525f1975b62c2d2b5aff28274ad57466
NCBI tax dump directory 
Taxonomy mapping file   /data/clustered_nr/nr_rep_seq_to_taxid.map
Taxonomy mapping mode   0
Taxonomy db mode        1
Threads                 20
Verbosity               3

Download taxdump.tar.gz

02/01 11:29:59 [NOTICE] Downloading 1 item(s)
[#b8b044 0B/0B CN:1 DL:0B]                                                                                                                                                          
02/01 11:30:01 [NOTICE] Allocating disk space. Use --file-allocation=none to disable it. See --file-allocation option in man page for more details.
[#b8b044 51MiB/61MiB(84%) CN:1 DL:10MiB]                                                                                                                                            
02/01 11:30:08 [NOTICE] Download complete: tmp/taxdump.tar.gz

Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
b8b044|OK  |   9.1MiB/s|tmp/taxdump.tar.gz

Status Legend:
(OK):download completed.
Loading nodes file ... Done, got 2550743 nodes
Loading merged file ... Done, added 75930 merged nodes.
Loading names file ... Done
Init RMQ ...Done
Killed

Context

I want to search some query sequences locally against a clustered version of the NR database. I downloaded the clustered database and taxonomy files (nr_cluster_taxid_formatted_final.tsv.gz, nr_rep_seq.fasta.gz) from here, which was created as per: https://research.arcadiascience.com/pub/resource-nr-clustering/release/3. I then processed the files like so:

gunzip -c nr_cluster_taxid_formatted_final.tsv.gz | cut -f1,2 > nr_rep_seq_to_taxid.map
mmseqs createdb nr_rep_seq.fasta.gz nr_rep_seq_db

After these completed successfully, I tried to create the taxdb as per the above:

mmseqs createtaxdb nr_rep_seq_db tmp --tax-mapping-file '/data/clustered_nr/nr_rep_seq_to_taxid.map' -v 3

But the job was killed.

Questions:

Your Environment

charlesfoster commented 1 month ago

Hello again,

I've been revisiting mmseqs again for taxonomic assignment, and unwittingly ran into this problem again before finding my own Github issue (the circle of life!). I was just wondering whether by now there is any advice on creating a taxdb when RAM is limited? I;m working with a pre-clustered version of the NR database that is currently not available directly through mmseqs databases.

After the standard createdb command, I ran the following:

mmseqs createtaxdb nr_clustered_mmseqs ~/TMP  --ncbi-tax-dump ~/.taxonkit/ --tax-mapping-file /data/clustered_nr/clustered_nr_taxmapping.tsv

I get output as per the OP in this issue, until the process dies with:

[truncated]
Loading names file ... Done
Init RMQ ...Done
Killed

I can see that the problem was most likely the RAM being exhausted (I received exit status 137). My workstation has 64GB of RAM, and accessing a server with more RAM for the creation of this database is not likely to be feasible.

Thanks

p.s. in case you've missed it for any reason, I would also like to point out that the current automated download of the NR/NT fasta files from NCBI using mmseqs databases might not work as desired moving forwards. As noted at https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/README.txt:

In April 2024, the BLAST FASTA files in this directory will no longer be
available. You can easily generate FASTA files yourself from the formatted
BLAST databases by using the BLAST utility blastdbcmd that comes with the
standalone BLAST programs. See NCBI Insights for more details
https://ncbiinsights.ncbi.nlm.nih.gov/2024/01/25/blast-fasta-unavailable-on-ftp/