soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 200 forks source link

mmseqs taxonomy: reducing memory #147

Closed nick-youngblut closed 5 years ago

nick-youngblut commented 5 years ago

I'm using mmseqs2 7.4e23d h21aa3a5_1 bioconda, and I'm trying to taxonomically classify a set of ~4 million representative AA sequences (generated by plass, clustered with linclust, then using a representative of each cluster), and I'm using uniclust90_2018_08 for the taxonomy db. The command is:

mmseqs taxonomy --threads 24 -e 1e-5 --start-sens 1 -s 6 --sens-steps 3 --lca-ranks "phylum:superphylum:subkingdom:kingdom:superkingdom" {seqDB} {taxDB} {outDB} {tmp_dir}

I've tried providing up to 720 GB of memory, and I still get a memory error: Could not allocate foundDiagonals memory in QueryMatcher. This happens during the stage:

Init data structures...
Compute score and coverage.
Touch data file /tmp/global2/nyoungblut/LLMGAG_27929269397/linclust/genes_db ... Done.
Touch data file /ebio/abt3_projects/databases_no-backup/uniclust/uniclust90_2018_08_db ... Done.
Query database type: Aminoacid
Target database type: Aminoacid
Calculation of Smith-Waterman alignments.
................................................................................................... 1 Mio. sequences processed
.......

Is there a good way of reducing the memory usage for mmseqs taxonomy? I didn't see anything in the script doc or the wiki on reducing memory usage for taxonomy inference.

milot-mirdita commented 5 years ago

Could you post the full log? MMseqs2 should be okay with far less memory than you gave it, sounds like you ran into another bug somehow.

nick-youngblut commented 5 years ago

Thanks for the quick response! Here's the whole log:

Program call:
taxonomy -e 1e-5 --start-sens 1 -s 6 --sens-steps 3 --lca-ranks phylum:superphylum:subkingdom:kingdom:superkingdom --threads 24 /tmp/global2/nyoungblut/LLMGAG_27929269397/linclust/genes_db /ebio/abt3_projects/databases_no-backup/uniclust/uniclust90_2018_08_db /tmp/global2/nyoungblut/LLMGAG_27929269397/taxonomy/genes_tax_db /tmp/global2/nyoungblut/LLMGAG_27929269397/taxonomy/tmp/

MMseqs Version:                                                             7.4e23d
Sub Matrix                                                                  blosum62.out
Add backtrace                                                               false
Alignment mode                                                              2
E-value threshold                                                           1e-05
Seq. Id Threshold                                                           0
Seq. Id. Mode                                                               0
Alternative alignments                                                      0
Coverage threshold                                                          0
Coverage Mode                                                               0
Max. sequence length                                                        65535
Max. results per query                                                      300
Compositional bias                                                          1
Realign hit                                                                 false
Max Reject                                                                  2147483647
Max Accept                                                                  2147483647
Include identical Seq. Id.                                                  false
Preload mode                                                                0
Pseudo count a                                                              1
Pseudo count b                                                              1.5
Score bias                                                                  0
Gap open cost                                                               11
Gap extension cost                                                          1
Threads                                                                     24
Verbosity                                                                   3
Sensitivity                                                                 6
K-mer size                                                                  0
K-score                                                                     2147483647
Alphabet size                                                               21
Offset result                                                               0
Split DB                                                                    0
Split mode                                                                  2
Split Memory Limit                                                          0
Diagonal Scoring                                                            1
Exact k-mer matching                                                        0
Mask Residues                                                               1
Minimum Diagonal score                                                      15
Spaced Kmer                                                                 1
Spaced k-mer pattern
Local temporary path
Rescore mode                                                                0
Remove hits by seq.id. and coverage                                         false
Sort results                                                                0
In substitution scoring mode, performs global alignment along the diagonal  false
Mask profile                                                                1
Profile e-value threshold                                                   0.001
Use global sequence weighting                                               false
Filter MSA                                                                  1
Maximum sequence identity threshold                                         0.9
Minimum seq. id.                                                            0
Minimum score per column                                                    -20
Minimum coverage                                                            0
Select n most diverse seqs                                                  1000
Omit Consensus                                                              false
Min codons in orf                                                           30
Max codons in length                                                        32734
Max orf gaps                                                                2147483647
Contig start mode                                                           2
Contig end mode                                                             2
Orf start mode                                                              0
Forward Frames                                                              1,2,3
Reverse Frames                                                              1,2,3
Translation Table                                                           1
Use all table starts                                                        false
Offset of numeric ids                                                       0
Add Orf Stop                                                                false
Number search iterations                                                    1
Start sensitivity                                                           1
Search steps                                                                3
Run a seq-profile search in slice mode                                      false
Strand selection                                                            1
Disk space limit                                                            0
Sets the MPI runner
Remove Temporary Files                                                      false
LCA Ranks                                                                   phylum:superphylum:subkingdom:kingdom:superkingdom
Blacklisted Taxa                                                            12908,28384
LCA Mode                                                                    2
Remove Temporary Files                                                      false
Sets the MPI runner

Program call:
search /tmp/global2/nyoungblut/LLMGAG_27929269397/linclust/genes_db /ebio/abt3_projects/databases_no-backup/uniclust/uniclust90_2018_08_db /tmp/global2/nyoungblut/LLMGAG_27929269397/taxonomy/tmp//15538800487586745695/first /tmp/global2/nyoungblut/LLMGAG_27929269397/taxonomy/tmp//15538800487586745695/tmp_hsp1 --sub-mat blosum62.out -a 0 --alignment-mode 2 -e 1e-05 --min-seq-id 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --max-seqs 300 --comp-bias-corr 1 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open 11 --gap-extend 1 --threads 24 -v 3 -s 6 -k 0 --k-score 2147483647 --alph-size 21 --offset-result 0 --split 0 --split-mode 2 --split-memory-limit 0 --diag-score 1 --exact-kmer-matching 0 --mask 1 --min-ungapped-score 15 --spaced-kmer-mode 1 --rescore-mode 0 --filter-hits 0 --sort-results 0 --global-alignment 0 --mask-profile 1 --e-profile 0.001 --wg 0 --filter-msa 1 --max-seq-id 0.9 --qid 0 --qsc -20 --cov 0 --diff 1000 --omit-consensus 0 --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 0 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --use-all-table-starts 0 --id-offset 0 --add-orf-stop 0 --num-iterations 1 --start-sens 1 --sens-steps 3 --slice-search 0 --strand 1 --disk-space-limit 0 --remove-tmp-files 0

MMseqs Version:                                                             7.4e23d
Sub Matrix                                                                  blosum62.out
Add backtrace                                                               false
Alignment mode                                                              2
E-value threshold                                                           1e-05
Seq. Id Threshold                                                           0
Seq. Id. Mode                                                               0
Alternative alignments                                                      0
Coverage threshold                                                          0
Coverage Mode                                                               0
Max. sequence length                                                        65535
Max. results per query                                                      300
Compositional bias                                                          1
Realign hit                                                                 false
Max Reject                                                                  2147483647
Max Accept                                                                  2147483647
Include identical Seq. Id.                                                  false
Preload mode                                                                0
Pseudo count a                                                              1
Pseudo count b                                                              1.5
Score bias                                                                  0
Gap open cost                                                               11
Gap extension cost                                                          1
Threads                                                                     24
Verbosity                                                                   3
Sensitivity                                                                 6
K-mer size                                                                  0
K-score                                                                     2147483647
Alphabet size                                                               21
Offset result                                                               0
Split DB                                                                    0
Split mode                                                                  2
Split Memory Limit                                                          0
Diagonal Scoring                                                            1
Exact k-mer matching                                                        0
Mask Residues                                                               1
Minimum Diagonal score                                                      15
Spaced Kmer                                                                 1
Spaced k-mer pattern
Local temporary path
Rescore mode                                                                0
Remove hits by seq.id. and coverage                                         false
Sort results                                                                0
In substitution scoring mode, performs global alignment along the diagonal  false
Mask profile                                                                1
Profile e-value threshold                                                   0.001
Use global sequence weighting                                               false
Filter MSA                                                                  1
Maximum sequence identity threshold                                         0.9
Minimum seq. id.                                                            0
Minimum score per column                                                    -20
Minimum coverage                                                            0
Select n most diverse seqs                                                  1000
Omit Consensus                                                              false
Min codons in orf                                                           30
Max codons in length                                                        32734
Max orf gaps                                                                2147483647
Contig start mode                                                           2
Contig end mode                                                             2
Orf start mode                                                              0
Forward Frames                                                              1,2,3
Reverse Frames                                                              1,2,3
Translation Table                                                           1
Use all table starts                                                        false
Offset of numeric ids                                                       0
Add Orf Stop                                                                false
Number search iterations                                                    1
Start sensitivity                                                           1
Search steps                                                                3
Run a seq-profile search in slice mode                                      false
Strand selection                                                            1
Disk space limit                                                            0
Sets the MPI runner
Remove Temporary Files                                                      false

Program call:
align /tmp/global2/nyoungblut/LLMGAG_27929269397/linclust/genes_db /ebio/abt3_projects/databases_no-backup/uniclust/uniclust90_2018_08_db /tmp/global2/nyoungblut/LLMGAG_27929269397/taxonomy/tmp//15538800487586745695/tmp_hsp1/17220669400861690567/pref_1.000 /tmp/global2/nyoungblut/LLMGAG_27929269397/taxonomy/tmp//15538800487586745695/tmp_hsp1/17220669400861690567/aln_1.000 --sub-mat blosum62.out -a 0 --alignment-mode 2 -e 1e-05 --min-seq-id 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --max-seqs 300 --comp-bias-corr 1 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open 11 --gap-extend 1 --threads 24 -v 3

MMseqs Version:             7.4e23d
Sub Matrix                  blosum62.out
Add backtrace               false
Alignment mode              2
E-value threshold           1e-05
Seq. Id Threshold           0
Seq. Id. Mode               0
Alternative alignments      0
Coverage threshold          0
Coverage Mode               0
Max. sequence length        65535
Max. results per query      300
Compositional bias          1
Realign hit                 false
Max Reject                  2147483647
Max Accept                  2147483647
Include identical Seq. Id.  false
Preload mode                0
Pseudo count a              1
Pseudo count b              1.5
Score bias                  0
Gap open cost               11
Gap extension cost          1
Threads                     24
Verbosity                   3

Init data structures...
Compute score and coverage.
Touch data file /tmp/global2/nyoungblut/LLMGAG_27929269397/linclust/genes_db ... Done.
Touch data file /ebio/abt3_projects/databases_no-backup/uniclust/uniclust90_2018_08_db ... Done.
Query database type: Aminoacid
Target database type: Aminoacid
Calculation of Smith-Waterman alignments.
................................................................................................... 1 Mio. sequences processed
.......
milot-mirdita commented 5 years ago

What is the error message?

Could not allocate foundDiagonals memory in QueryMatcher should only be possible to happen during the prefiltering stage not the alignment stage.

nick-youngblut commented 5 years ago

Could not allocate foundDiagonals memory in QueryMatcher is the only error message that I received.

I was running this in a snakemake pipeline, which tried the run with progressively more memory (240, 480, 720 GB), and each time, I got the error: Could not allocate foundDiagonals memory in QueryMatcher, and the log file looked the same (less dots at the end of the log file when less memory was used)

milot-mirdita commented 5 years ago

I am not sure how snakemake implements its memory limit, but you might have to tell the MMseqs2 prefilter how much memory it is allowed to use using the --split-memory-limit parameter. By default MMseqs2 assumes it is supposed to use the whole machine.

For example with --split-memory-limit 200000000 for about 200GB of max memory. I think the description text is however slightly wrong, the parameter expects the memory in kilobyte not megabyte. I have to double check that.

nick-youngblut commented 5 years ago

Sorry for not making the memory limit clear: snakemake is just running qsub jobs for me, and it's just setting different amounts of memory (eg., qsub -l h_vmem=720G).

I'll try --split-memory-limit and see if it fixes the problem

nick-youngblut commented 5 years ago

It turns out that the issue wasn't a memory error, but instead a bug in my pipeline code that killed the job prematurely. Sorry to waste your time on this.