steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
842 stars 104 forks source link

Segfault on prefiltering step with large k-mer databases #160

Open richardshuai opened 1 year ago

richardshuai commented 1 year ago

I am attempting to cluster a dataset of ~1.3M backbone-only structures (so all residues are "glycine"), each about 70 residues long. I have made sure all PDBs are well-formed (i.e. have all 4 backbone atoms in each residue and non-empty). I'm running foldseek cluster using --similarity-type 1, --tmscore_threshold 0.99, -c 0.99, and --cluster_reassign. k is automatically determined to be 6.

I'm not exactly sure the reason, but it seems like whenever the prefiltering step encounters a large number of kmers, it leads to a segfault (Error: Prefilter step 1 died). Using -s 1.0 keeps the number of entries smaller, and I am successfully able to cluster without running into any segfaults. Furthermore, using the default sensitivity with fewer PDBs (~100K) does succeed -- by testing different subsets of my dataset, it seems to be segfaulting purely based on dataset size. Is there a workaround for this that allows me to keep the sensitivity of the prefiltering step without running into segfaults?

I have tried running the same command with 1024GB of RAM and with 8TB of disk space for tmp, but the same error occurs.

Here is the full output of the segfaulting command:

cluster /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/db /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tm0.99c0.99k0_reassign /media/current/dataset
s/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp --similarity-type 1 --tmscore-threshold 0.99 -c 0.99 --cluster-reassign -k 0                                                                                           

MMseqs Version:                         6.29e2557                                                                                                                                                                                                             
Substitution matrix                     aa:3di.out,nucl:3di.out                                                                                                                                                                                               
Seed substitution matrix                aa:3di.out,nucl:3di.out                                                                                                                                                                                               
Sensitivity                             4                                                                                                                                                                                                                     
k-mer length                            0                                                                                                                                                                                                                     
k-score                                 seq:2147483647,prof:2147483647                                                                                                                                                                                        
Max sequence length                     65535                                                                                                                                                                                                                 
Max results per query                   1000                                                                                                                                                                                                                  
Split database                          0                                                                                                                                                                                                                     
Split mode                              2                                                                                                                                                                                                                     
Split memory limit                      0                                                                                                                                                                                                                     
Coverage threshold                      0.99                                                                                                                                                                                                                  
Coverage mode                           0                                                                                                                                                                                                                     
Compositional bias                      0                                                                                                                                                                                                                     
Compositional bias                      1                                                                                                                                                                                                                     
Diagonal scoring                        true                                                                                                                                                                                                                  
Exact k-mer matching                    0                                                                                                                                                                                                                     
Mask residues                           0                                                                                                                                                                                                                     
Mask residues probability               0.9                                                                                                                                                                                                                   
Mask lower case residues                1                                                                                                                                                                                                                     
Minimum diagonal score                  30                                                                                                                                                                                                                    
Selected taxa                                                                                                                                                                                                                                                 
Spaced k-mers                           1                                                                                                                                                                                                                     
Preload mode                            0                                                                                                                                                                                                                     
Spaced k-mer pattern                                                                                                                                                                                                                                          
Local temporary path                                                                                                                                                                                                                                          
Threads                                 12                                                                                                                                                                                                                    
Compressed                              0                                                                                                                                                                                                                     
Verbosity                               3                                                                                                                                                                                                                     
TMscore threshold                       0.99                                                                                                                                                                                                                  
LDDT threshold                          0                                                                                                                                                                                                                     
Sort by structure bit score             0                                                                                                                                                                                                                     
Add backtrace                           false                                                                                                                                                                                                                 
Alignment mode                          3                                                                                                                                                                                                                     
Alignment mode                          0                                                                                                                                                                                                                     
E-value threshold                       0.01                                                                                                                                                                                                                  
Seq. id. threshold                      0                                                                                                                                                                                                                     
Min alignment length                    0                                                                                                                                                                                                                     
Seq. id. mode                           0                                                                                                                                                                                                                     
Alternative alignments                  0                                                                                                                                                                                                                     
Max reject                              2147483647                                                                                                                                                                                                            
Max accept                              2147483647                                                                                                                                                                                                            
Gap open cost                           aa:10,nucl:10                                                                                                                                                                                                         
Gap extension cost                      aa:1,nucl:1                                                                                                                                                                                                           
Rescore mode                            0                                                                                                                                                                                                                     
Remove hits by seq. id. and coverage    false                                                                                                                                                                                                                 
Sort results                            0                                                                                                                                                                                                                     
TMalign hit order                       0                  
TMalign fast                            1
Cluster mode                            0
Max connected component depth           1000
Similarity type                         1
Weight file name
Cluster Weight threshold                0.9
Single step clustering                  false
Cascaded clustering steps               3
Cluster reassign                        true
Remove temporary files                  false
Force restart with latest tmp           false
MPI runner
k-mers per sequence                     300
Scale k-mers per sequence               aa:0.000,nucl:0.200
Adjust k-mer length                     false
Shift hash                              67
Include only extendable                 false
Skip repeating k-mers                   false

Set cluster sensitivity to -s 8.000000
Set cluster mode SET COVER
Set cluster iterations to 3
kmermatcher /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/db_ss /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/pref --sub-ma
t 'aa:3di.out,nucl:3di.out' --alph-size aa:21,nucl:5 --min-seq-id 0 --kmer-per-seq 300 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --cov-mode 0 -k 0 -c 0.99 --max-seq-len
65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 12 --compressed 0 -v 3 --cluster-weight-threshold 0.9

kmermatcher /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/db_ss /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/pref --sub-ma
t 'aa:3di.out,nucl:3di.out' --alph-size aa:21,nucl:5 --min-seq-id 0 --kmer-per-seq 300 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --cov-mode 0 -k 0 -c 0.99 --max-seq-len
65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 12 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Database size: 1338443 type: Aminoacid
Reduced amino acid alphabet: (A F) (C V) (D B) (E Z) (G H) (I M T) (K W) (L J) (N R S) (P) (Q) (Y) (X)

Generate k-mers list for 1 split
[=================================================================] 100.00% 1.34M 2s 160ms
Sort kmer 0h 0m 5s 735ms
Sort by rep. sequence 0h 0m 0s 198ms
Time for fill: 0h 0m 0s 337ms
Time for merging to pref: 0h 0m 0s 0ms
Time for processing: 0h 0m 12s 2ms
structurerescorediagonal /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/db /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/db /media/current/datasets/jaffe/preproc
essed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/pref /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/pref_rescore1 --tmscore-th
reshold 0.99 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.01 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.99 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --comp-b
ias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs
 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 12 --compressed 0 -v 3

[=================================================================] 100.00% 1.34M 16s 509ms
Time for merging to pref_rescore1: 0h 0m 1s 295ms
Time for processing: 0h 0m 19s 552ms
clust /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/db /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/pref_rescore1 /media/c
urrent/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 1 --threads 12 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Set Cover
[=================================================================] 100.00% 1.34M 0s 582ms
Sort entries
Find missing connections
Found 288520 new connections.
Reconstruct initial order
[=================================================================] 100.00% 1.34M 0s 653ms
Add missing connections
[=================================================================] 100.00% 1.34M 0s 224ms

Time for read in: 0h 0m 1s 938ms
Total time: 0h 0m 2s 506ms

Size of the sequence database: 1338443
Size of the alignment database: 1338443
Number of clusters: 1127324

Writing results 0h 0m 0s 474ms
Time for merging to pre_clust: 0h 0m 0s 0ms
Time for processing: 0h 0m 3s 608ms
createsubdb /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/order_redundancy /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cl
uster_dbs/tmp/18069091982169965387/pref /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/pref_filter1 -v 3 --subdb-mode 1

Time for merging to pref_filter1: 0h 0m 0s 0ms
Time for processing: 0h 0m 1s 338ms
filterdb /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/pref_filter1 /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_d
bs/tmp/18069091982169965387/pref_filter2 --filter-file /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/order_redundancy --threads 12 --compressed 0 -v 3

Filtering using file(s)
[=================================================================] 100.00% 1.13M 0s 295ms
Time for merging to pref_filter2: 0h 0m 1s 154ms
Time for processing: 0h 0m 2s 496ms
structurealign /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/db /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/db /media/current/datasets/jaffe/preprocessed/fold
seek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/pref_filter2 /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/aln.linclust --tmscore-thres
hold 0.99 --lddt-threshold 0 --sort-by-structure-bits 0 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.01 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.99 --cov-mode 0 --ma
x-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 -
-realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 12 --compressed 0 -v 3

[=================================================================] 100.00% 1.13M 44s 277ms
Time for merging to aln.linclust: 0h 0m 1s 66ms
Time for processing: 0h 0m 46s 588ms
createsubdb /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/order_redundancy /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/db
 /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/pre_clustered_seqs -v 3 --subdb-mode 1

Time for merging to pre_clustered_seqs: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 699ms
clust /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/pre_clustered_seqs /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluste
r_dbs/tmp/18069091982169965387/aln.linclust /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/clust.linclust --cluster-mode 0 --max-iterations 1000 --similarity-type 1 --thre
ads 12 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Set Cover
[=================================================================] 100.00% 1.13M 0s 411ms
Sort entries
Find missing connections
Found 0 new connections.
Reconstruct initial order
[=================================================================] 100.00% 1.13M 0s 494ms
Add missing connections==========================================>] 99.49% 1.12M eta 0s
[=================================================================] 100.00% 1.13M 0s 85ms

Time for read in: 0h 0m 1s 387ms
Total time: 0h 0m 1s 604ms
Size of the sequence database: 1127324
Size of the alignment database: 1127324
Number of clusters: 1127324

Writing results 0h 0m 0s 311ms
Time for merging to clust.linclust: 0h 0m 0s 0ms
Time for processing: 0h 0m 2s 529ms
mergeclusters /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/db /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/clu_redundancy
 /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/pre_clust /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/1806
9091982169965387/clust.linclust --threads 12 --compressed 0 -v 3

Clustering step 1
[=================================================================] 100.00% 1.13M 0s 263ms
Clustering step 2
[=================================================================] 100.00% 1.13M 0s 628ms
Write merged clustering
[=================================================================] 100.00% 1.34M 0s 854ms
Time for merging to clu_redundancy: 0h 0m 1s 117ms
Time for processing: 0h 0m 2s 500ms
createsubdb /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/clu_redundancy /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/db_s
s /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/input_step_redundancy_ss -v 3 --subdb-mode 1

Time for merging to input_step_redundancy_ss: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 687ms
createsubdb /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/clu_redundancy /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/db_c
a /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/input_step_redundancy_ca -v 3 --subdb-mode 1

Time for merging to input_step_redundancy_ca: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 725ms
createsubdb /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/clu_redundancy /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/db /
media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 685ms
prefilter /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/input_step_redundancy_ss /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filte
red/cluster_dbs/tmp/18069091982169965387/input_step_redundancy_ss /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/pref_step0 --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-
mat 'aa:3di.out,nucl:3di.out' -s 1 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 100 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.99 --cov-mode 0 --comp-bias-corr 0 --comp-bias-corr-scale 1
--diag-score 0 --exact-kmer-matching 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --min-ungapped-score 0 --add-self-matches 1 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads
12 --compressed 0 -v 3

Query database size: 1127324 type: Aminoacid
Estimated memory consumption: 2G
Target database size: 1127324 type: Aminoacid
Index table k-mer threshold: 154 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 1.13M 0s 861ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 100.00% 1.13M 0s 714ms
Index statistics
Entries:          794562
DB size:          492 MB
Avg k-mer size:   0.012415
Top 10 k-mers
    KWDQHN      56902
    WTGVGI      22650
    DCWNWW      17061
    WWDCRN      12750
    HNRGNF      6920
    TFNRDI      6587
    CTPPKT      5284
Time for index table init: 0h 0m 3s 434ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 154
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 1127324
Target db start 1 to 1127324
[=================================================================] 100.00% 1.13M 54s 828ms

0.017676 k-mers per position
9325 DB matches per sequence
0 overflows
34 sequences passed prefiltering per query sequence
1 median result list length
0 sequences with 0 size result lists
Time for merging to pref_step0: 0h 0m 1s 286ms
Time for processing: 0h 1m 2s 152ms
structurealign /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/input_step_redundancy /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_fil
tered/cluster_dbs/tmp/18069091982169965387/input_step_redundancy /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/pref_step0 /media/current/datasets/jaffe/preprocessed/folds
eek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/aln_step0 --tmscore-threshold 0.99 --lddt-threshold 0 --sort-by-structure-bits 0 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wra
pped-scoring 0 -e 0.01 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.99 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0
 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads
 12 --compressed 0 -v 3

[=================================================================] 100.00% 1.13M 6m 36s 886ms
Time for merging to aln_step0: 0h 0m 1s 223ms
Time for processing: 0h 6m 39s 440ms
clust /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/input_step_redundancy /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/clu
ster_dbs/tmp/18069091982169965387/aln_step0 /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/clu_step0 --cluster-mode 0 --max-iterations 1000 --similarity-type 1 --threads 1
2 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Set Cover
[=================================================================] 100.00% 1.13M 0s 407ms
Sort entries
Find missing connections
Found 119297 new connections.
Reconstruct initial order
[=================================================================] 100.00% 1.13M 0s 579ms
Add missing connections
[=================================================================] 100.00% 1.13M 0s 164ms

Time for read in: 0h 0m 1s 582ms
Total time: 0h 0m 2s 13ms

Size of the sequence database: 1127324
Size of the alignment database: 1127324
Number of clusters: 1068812

Writing results 0h 0m 0s 441ms
Time for merging to clu_step0: 0h 0m 0s 0ms
Time for processing: 0h 0m 3s 82ms
createsubdb /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/clu_step0 /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_d
bs/tmp/18069091982169965387/input_step_redundancy_ss /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/input_step1_ss -v 3 --subdb-mode 1

Time for merging to input_step1_ss: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 667ms
createsubdb /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/clu_step0 /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_d
bs/tmp/18069091982169965387/input_step_redundancy_ca /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/input_step1_ca -v 3 --subdb-mode 1

Time for merging to input_step1_ca: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 672ms
createsubdb /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/clu_step0 /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_d
bs/tmp/18069091982169965387/input_step_redundancy /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/input_step1 -v 3 --subdb-mode 1

Time for merging to input_step1: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 642ms
prefilter /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/input_step1_ss /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluste
r_dbs/tmp/18069091982169965387/input_step1_ss /media/current/datasets/jaffe/preprocessed/foldseek_dbs/all_cdrs_bb_only/precluster_filtered/cluster_dbs/tmp/18069091982169965387/pref_step1 --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl
:3di.out' -s 4.5 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 200 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.99 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1
--exact-kmer-matching 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --min-ungapped-score 30 --add-self-matches 1 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 12 --compresse
d 0 -v 3

Query database size: 1068812 type: Aminoacid
Estimated memory consumption: 2G
Target database size: 1068812 type: Aminoacid
Index table k-mer threshold: 123 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 1.07M 1s 154ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 100.00% 1.07M 1s 977ms
Index statistics
Entries:          37858673
DB size:          704 MB
Avg k-mer size:   0.591542
Top 10 k-mers
    DDDQWW      666989
    PGVHWG      340672
    QLWAQI      284484
    WAVINV      278736
    WAVISV      248699
    DQCWVQ      168003
    VRDDFT      153208
    GSPVGR      131239
    DDDPWW      125369
    WAVIPV      111417
Time for index table init: 0h 0m 5s 257ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 123
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 1068812
Target db start 1 to 1068812
Segmentation fault (core dumped)                                  ] 0.00% 1 eta -
Error: Prefilter step 1 died
martin-steinegger commented 1 year ago

@richardshuai is it possible to share the dataset? I can not see that anything is wrong from the log.

Also I just implemented an alignment mode that considers only the structure and not the amino acids. I recommend using this instead, just add --alignment-type 0 to your clustering command.

richardshuai commented 1 year ago

Thank you for looking into this and for adding the structure-only based clustering option — while I am still getting the error with this option, it is definitely more convenient. Sorry for the late response on the dataset, I had to figure out a way of uploading it.

The exact dataset I'm using is now available on Zenodo here. It is a dataset of a little over 1.3 million antibody structure predictions, extracting just the backbone atoms of 6 CDR loops (with some anchor residues from the framework at the ends of each CDR). The PDBs are split into 14 different .tar.gz files, each with 100K PDBs each (the last one has fewer than 100K). Let me know if you need more information about the dataset and I am happy to provide them. So far, -s 4.0 seems to work without segfaulting and clusters these PDBs well, but I'd prefer to be able to run with higher sensitivities as well. Thank you!