soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.48k stars 200 forks source link

Segmentation Fault - Prefilter died on OMGprot_50 db #915

Open jhoff13 opened 23 hours ago

jhoff13 commented 23 hours ago

Expected Behavior

I'm trying to run mmseqs search against the OMG_prot50 database. This database once converted to a mmseqs db is 42GB and was constructed by converting .parquet files to fastas using a custom script.

Current Behavior

The cmd ends with a prefilter error. I am submitted to nodes with 192gb of RAM and when watching the cmd run it only hits 25gb at its max before failing. Not sure if this is a RAM issue for this reason.

Steps to Reproduce (for bugs)

python scripts/parquet_to_fasta.py -i path/to/OMG_prot50/data -o path/to/OMG_prot50.fasta
mmseqs createdb path/to/OMG_prot50.fasta path/to/OMG_prot50_db.fasta
mmseqs search path/to/test_db.faa path/to/OMG_prot50_db.fasta /home/gridsan/jhoff/seq/context_testing/mmseqs/results.out tmp

MMseqs Output (for bugs)

search /home/gridsan/jhoff/seq/context_testing/mmseqs/ARG_test_db.faa /data1/groups/solab/OMG_prot50/OMG_mmseqs_db_full/OMG_prot50_db.fasta /home/gridsan/jhoff/seq/context_testing/mmseqs/results.out tmp 

MMseqs Version:                         747c64cc8db3b4803a0f1194a3f75b3ba9f81bcb
Substitution matrix                     aa:blosum62.out,nucl:nucleotide.out
Add backtrace                           false
Alignment mode                          2
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       0.001
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0
Coverage mode                           0
Max sequence length                     65535
Compositional bias                      1
Compositional bias                      1
Max reject                              2147483647
Max accept                              2147483647
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          substitution:1.100,context:1.400
Pseudo count b                          substitution:4.100,context:5.800
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Correlation score weight                0
Gap open cost                           aa:11,nucl:5
Gap extension cost                      aa:1,nucl:2
Zdrop                                   40
Threads                                 96
Compressed                              0
Verbosity                               3
Seed substitution matrix                aa:VTML80.out,nucl:nucleotide.out
Sensitivity                             5.7
k-mer length                            0
Target search mode                      0
k-score                                 seq:2147483647,prof:2147483647
Alphabet size                           aa:21,nucl:5
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask residues probability               0.9
Mask lower case residues                0
Minimum diagonal score                  15
Selected taxa                           
Spaced k-mers                           1
Spaced k-mer pattern                    
Local temporary path                    
Use GPU                                 0
Use GPU server                          0
Prefilter mode                          0
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile E-value threshold               0.1
Global sequence weighting               false
Allow deletions                         false
Filter MSA                              1
Use filter only at N seqs               0
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0.0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Pseudo count mode                       0
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1,2,3
Reverse frames                          1,2,3
Translation table                       1
Translate orf                           0
Use all table starts                    false
Offset of numeric ids                   0
Create lookup                           0
Overlap between sequences               0
Sequence split mode                     1
Header split mode                       0
Chain overlapping alignments            0
Merge query                             1
Search type                             0
Search iterations                       1
Start sensitivity                       4
Search steps                            1
Exhaustive search mode                  false
Filter results during exhaustive search 0
Strand selection                        1
LCA search mode                         false
Disk space limit                        0
MPI runner                              
Force restart with latest tmp           false
Remove temporary files                  false
Translation mode                        0

prefilter /home/gridsan/jhoff/seq/context_testing/mmseqs/ARG_test_db.faa /data1/groups/solab/OMG_prot50/OMG_mmseqs_db_full/OMG_prot50_db.fasta tmp/2665397566262902967/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3 -s 5.7 

Query database size: 1 type: Aminoacid
Target split mode. Searching through 10 splits
Estimated memory consumption: 141G
Target database size: 207248723 type: Aminoacid
Process prefiltering step 1 of 10

Index table k-mer threshold: 122 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 20.72M 46s 224ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 20.72M 18s 37ms
Index statistics
Entries:          4010088421
DB size:          32711 MB
Avg k-mer size:   3.132882
Top 10 k-mers
    GPGGKLL 21779
    GPGGKLA 13254
    GQQVARA 12932
    GGQRVAR 12122
    LAMHETP 9475
    LSGQQAI 8419
    FLNSHRT 8155
    GGRRVAR 7576
    GLGNGKT 7414
    AIGSGKS 6751
Time for index table init: 0h 1m 14s 437ms
k-mer similarity threshold: 122
Starting prefiltering scores calculation (step 1 of 10)
Query db start 1 to 1
Target db start 1 to 20716625
[=================================================================] 1 0s 1ms

775.324841 k-mers per position
594313 DB matches per sequence
0 overflows
51 sequences passed prefiltering per query sequence
51 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0_tmp_0: 0h 0m 0s 124ms
Time for merging to pref_0_tmp_0_tmp: 0h 0m 0s 21ms
Process prefiltering step 2 of 10

Index table k-mer threshold: 122 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 20.72M 27s 811ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 20.72M 17s 321ms
Index statistics
Entries:          4010617253
DB size:          32714 MB
Avg k-mer size:   3.133295
Top 10 k-mers
    GPGGKLL 21795
    GPGGKLA 13360
    GQQVARA 12807
    GGQRVAR 12160
    LAMHETP 9499
    LSGQQAI 8430
    FLNSHRT 8172
    GGRRVAR 7362
    GLGNGKT 7330
    AIGSGKS 6681
Time for index table init: 0h 0m 55s 244ms
k-mer similarity threshold: 122
Starting prefiltering scores calculation (step 2 of 10)
Query db start 1 to 1
Target db start 20716626 to 41441422
[=================================================================] 1 0s 1ms

775.324841 k-mers per position
595252 DB matches per sequence
0 overflows
51 sequences passed prefiltering per query sequence
51 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0_tmp_1: 0h 0m 0s 73ms
Time for merging to pref_0_tmp_1_tmp: 0h 0m 0s 58ms
Process prefiltering step 3 of 10

Index table k-mer threshold: 122 at k-mer size 7 
Index table: counting k-mers
[=========Segmentation fault
Error: Prefilter died

Context

Not sure why this is failing. I've tried splitting the database into thirds and tenths, same error. I have tried running with -s 1 --max-seqs 100 parameters to increase efficiency. It would be helpful to have this database preloaded to mmseqs.

Your Environment

milot-mirdita commented 10 hours ago

Could you share your conversion script?

Can you run the same command as here to check if there are some broken FASTA entries: https://github.com/soedinglab/MMseqs2/issues/911#issuecomment-2516404541

jhoff13 commented 1 hour ago

Fixed - I converted each parquet file to a separate database where it only require ~20 Gb of RAM and runs fine.