prefilter step died when running easy-search: Segmentation fault (core dumped)

szimmerman92 commented 2 months ago

Expected Behavior

easy-search should finish execution without errors

Current Behavior

Error during pre-filter step

Index table k-mer threshold: 0 at k-mer size 15 Index table: counting k-mers Segmentation fault (core dumped) ] 0.00% 1 eta - Error: Prefilter died Error: Search step died Error: Search died

Steps to Reproduce (for bugs)

First create a custom nucleotide database

mmseqs createdb --dbtype 2 --compressed 1 refseq_bacteria_archaea_fungi_viral.fna.gz seqTaxDB mmseqs createtaxdb seqTaxDB tmp --ncbi-tax-dump ncbi-taxdump --tax-mapping-file fastaid_taxid.tsv

Next run easy-search

mmseqs easy-search all_nuc.fasta seqTaxDB tax_assignments.txt tmp --search-type 3 --min-seq-id 0.65 -e 0.01 -c 0.8 --cov-mode 2 --threads 16

MMseqs Output (for bugs)

Below is the output of easy-search

easy-search all_nuc.fasta seqTaxDB tax_assignments.txt tmp --search-type 3 --min-seq-id 0.65 -e 0.01 -c 0.8 --cov-mode 2 --threads 16

MMseqs Version: 8ef39f4151eddcdc78f9c2dadf6b4dd6864435c9 Substitution matrix aa:blosum62.out,nucl:nucleotide.out Add backtrace false Alignment mode 3 Alignment mode 0 Allow wrapped scoring false E-value threshold 0.01 Seq. id. threshold 0.65 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Coverage threshold 0.8 Coverage mode 2 Max sequence length 65535 Compositional bias 1 Compositional bias 1 Max reject 2147483647 Max accept 2147483647 Include identical seq. id. false Preload mode 0 Pseudo count a substitution:1.100,context:1.400 Pseudo count b substitution:4.100,context:5.800 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Correlation score weight 0 Gap open cost aa:11,nucl:5 Gap extension cost aa:1,nucl:2 Zdrop 40 Threads 16 Compressed 0 Verbosity 3 Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out Sensitivity 5.7 k-mer length 0 Target search mode 0 k-score seq:2147483647,prof:2147483647 Alphabet size aa:21,nucl:5 Max results per query 300 Split database 0 Split mode 2 Split memory limit 0 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask residues probability 0.9 Mask lower case residues 0 Minimum diagonal score 15 Selected taxa
Spaced k-mers 1 Spaced k-mer pattern
Local temporary path
Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Mask profile 1 Profile E-value threshold 0.001 Global sequence weighting false Allow deletions false Filter MSA 1 Use filter only at N seqs 0 Maximum seq. id. threshold 0.9 Minimum seq. id. 0.0 Minimum score per column -20 Minimum coverage 0 Select N most diverse seqs 1000 Pseudo count mode 0 Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Add orf stop false Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Chain overlapping alignments 0 Merge query 1 Search type 3 Search iterations 1 Start sensitivity 4 Search steps 1 Prefilter mode 0 Exhaustive search mode false Filter results during exhaustive search 0 Strand selection 1 LCA search mode false Disk space limit 0 MPI runner
Force restart with latest tmp false Remove temporary files true Alignment format 0 Format alignment output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits Database output false Overlap threshold 0 Database type 0 Shuffle input database true Createdb mode 0 Write lookup file 0 Greedy best hits false

createdb all_nuc.fasta tmp/7701176895607249840/query --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3

Converting sequences [1335322] 2s 17mss Time for merging to query_h: 0h 0m 0s 221ms Time for merging to query: 0h 0m 1s 64ms Database type: Nucleotide Time for processing: 0h 0m 4s 959ms Create directory tmp/7701176895607249840/search_tmp search tmp/7701176895607249840/query seqTaxDB tmp/7701176895607249840/result tmp/7701176895607249840/search_tmp --alignment-mode 3 -e 0.01 --min-seq-id 0.65 -c 0.8 --cov-mode 2 --threads 16 -s 5.7 --search-type 3 --remove-tmp-files 1

splitsequence seqTaxDB tmp/7701176895607249840/search_tmp/9045538653068861586/target_seqs_split --max-seq-len 10000 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --create-lookup 0 --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 22.15M 12s 856ms
Time for merging to target_seqs_split_h: 0h 0m 31s 837ms Time for merging to target_seqs_split: 0h 0m 35s 517ms Time for processing: 0h 1m 59s 373ms extractframes tmp/7701176895607249840/query tmp/7701176895607249840/search_tmp/9045538653068861586/query_seqs --forward-frames 1 --reverse-frames 1 --create-lookup 0 --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 1.34M 0s 620ms
Time for merging to query_seqs_h: 0h 0m 0s 734ms Time for merging to query_seqs: 0h 0m 2s 576ms Time for processing: 0h 0m 5s 91ms splitsequence tmp/7701176895607249840/search_tmp/9045538653068861586/query_seqs tmp/7701176895607249840/search_tmp/9045538653068861586/query_seqs_split --max-seq-len 10000 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --create-lookup 0 --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 2.67M 0s 919ms
Time for merging to query_seqs_split_h: 0h 0m 0s 832ms Time for merging to query_seqs_split: 0h 0m 0s 878ms Time for processing: 0h 0m 3s 919ms prefilter tmp/7701176895607249840/search_tmp/9045538653068861586/query_seqs_split tmp/7701176895607249840/search_tmp/9045538653068861586/target_seqs_split tmp/7701176895607249840/search_tmp/9045538653068861586/search/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 15 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 10000 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 2 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 1 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 16 --compressed 0 -v 3 -s 5.7

Query database size: 2670930 type: Nucleotide Target split mode. Searching through 18 splits Estimated memory consumption: 326G Target database size: 100684280 type: Nucleotide Process prefiltering step 1 of 18

Index table k-mer threshold: 0 at k-mer size 15 Index table: counting k-mers Segmentation fault (core dumped) ] 0.00% 1 eta - Error: Prefilter died Error: Search step died Error: Search died

Context

Hi I am trying to run an nucleotide-nucleotide search in mmseq2 with a custom database. This error does not occur with a different, smaller nucleotide database.

Thank you very much for this amazing tool and all your hard work.

Your Environment

I am using a google cloud VM with 64 CPUs and 416 GBs of memory on an ubuntu operating system, version 20.04.

I install mmseq with the command

static build with AVX2 (fastest) wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

yuvaranimasarapu commented 1 month ago

I have the same error when running a NT search in mmseq2 NT NCBI database. I am running on our internal server with 256 GB memory.

jasmezz commented 1 month ago

I've encountered segfault errors with mmseqs due to not enough memory (which is a valid reason for segfaults, according to quick web search). Large databases like NT/GTDB might need around 900GB RAM, so I would guess too little RAM is the reason in your cases as well.

soedinglab / MMseqs2