steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
695 stars 92 forks source link

Questions regards to `easy-cluster` dying during Prefilter step 0 #227

Open Immortals-33 opened 5 months ago

Immortals-33 commented 5 months ago

Dear Foldseek Team:

Thank you for this amazing tool and careful maintenance! I'm using Foldseek to cluster a number of PDB files, but it ran into trouble under some specific circumstance.

Expected Behavior

I'm using foldseek easy-cluster to cluster a protein folder containing $100$ PDB files with length $50$, using TM-align mode. The parameters I used are listed below:
foldseek easy-cluster ./input_pdbs test ./tmp --alignment-type 1 --alignment-mode 2 --tmscore-threshold 0.5

Current Behavior

Foldseek easy-cluster dies on Prefilter step 0, implying: No k-mer could be extracted for the database tmp//2011124407538103508/clu_tmp/15340415873745362011/input_step_redundancy_ss. Maybe the sequences length is less than 14 residues.

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

Foldseek Output (for bugs)

MMseqs Version:                         8.ef4e960
Substitution matrix                     aa:3di.out,nucl:3di.out
Seed substitution matrix                aa:3di.out,nucl:3di.out
Sensitivity                             4
k-mer length                            0
Target search mode                      0
k-score                                 seq:2147483647,prof:2147483647
Max sequence length                     65535
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      0
Coverage threshold                      0
Coverage mode                           0
Compositional bias                      1
Compositional bias                      1
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask residues probability               0.9
Mask lower case residues                1
Minimum diagonal score                  30
Selected taxa                           
Spaced k-mers                           1
Preload mode                            0
Spaced k-mer pattern                    
Local temporary path                    
Threads                                 64
Compressed                              0
Verbosity                               3
TMscore threshold                       0.5
LDDT threshold                          0
Sort by structure bit score             1
Alignment type                          1
Add backtrace                           false
Alignment mode                          2
Alignment mode                          0
E-value threshold                       10
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Max reject                              2147483647
Max accept                              2147483647
Gap open cost                           aa:10,nucl:10
Gap extension cost                      aa:1,nucl:1
TMalign hit order                       0
TMalign fast                            1
Cluster mode                            0
Max connected component depth           1000
Similarity type                         2
Weight file name                        
Cluster Weight threshold                0.9
Single step clustering                  false
Cascaded clustering steps               3
Cluster reassign                        false
Remove temporary files                  true
Force restart with latest tmp           false
MPI runner                              
k-mers per sequence                     21
Scale k-mers per sequence               aa:0.000,nucl:0.200
Adjust k-mer length                     false
Shift hash                              67
Include only extendable                 false
Skip repeating k-mers                   false
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Chain name mode                         0
Write mapping file                      0
Mask b-factor threshold                 0
Coord store mode                        2
Write lookup file                       1
Tar Inclusion Regex                     .*
Tar Exclusion Regex                     ^$
File Inclusion Regex                    .*
File Exclusion Regex                    ^$

Create directory tmp//2011124407538103508/clu_tmp
cluster tmp//2011124407538103508/input tmp//2011124407538103508/clu tmp//2011124407538103508/clu_tmp --tmscore-threshold 0.5 --alignment-type 1 --alignment-mode 2 --remove-tmp-files 1 

Set cluster sensitivity to -s 8.000000
Set cluster mode SET COVER
Set cluster iterations to 3
kmermatcher tmp//2011124407538103508/input_ss tmp//2011124407538103508/clu_tmp/15340415873745362011/pref --sub-mat 'aa:3di.out,nucl:3di.out' --alph-size aa:21,nucl:5 --min-seq-id 0 --kmer-per-seq 300 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9 

kmermatcher tmp//2011124407538103508/input_ss tmp//2011124407538103508/clu_tmp/15340415873745362011/pref --sub-mat 'aa:3di.out,nucl:3di.out' --alph-size aa:21,nucl:5 --min-seq-id 0 --kmer-per-seq 300 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9 

Database size: 100 type: Aminoacid
Reduced amino acid alphabet: (A F) (C V) (D B) (E Z) (G H) (I M T) (K W) (L J) (N R S) (P) (Q) (Y) (X) 

Generate k-mers list for 1 split
[=================================================================] 100 0s 19ms
Sort kmer 0h 0m 0s 20ms
Sort by rep. sequence 0h 0m 0s 0ms
Time for fill: 0h 0m 0s 0ms
Time for merging to pref: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 77ms
structurerescorediagonal tmp//2011124407538103508/input tmp//2011124407538103508/input tmp//2011124407538103508/clu_tmp/15340415873745362011/pref tmp//2011124407538103508/clu_tmp/15340415873745362011/pref_rescore1 --tmscore-threshold 0.5 --lddt-threshold 0 --alignment-type 1 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.01 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 64 --compressed 0 -v 3 

[=================================================================] 100 0s 147ms
Time for merging to pref_rescore1: 0h 0m 0s 18ms
Time for processing: 0h 0m 0s 455ms
clust tmp//2011124407538103508/input tmp//2011124407538103508/clu_tmp/15340415873745362011/pref_rescore1 tmp//2011124407538103508/clu_tmp/15340415873745362011/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9 

Clustering mode: Set Cover
[=================================================================] 100 0s 0ms
Sort entries
Find missing connections
Found 109 new connections.
Reconstruct initial order
[=================================================================] 100 0s 0ms
Add missing connections
[=================================================================] 100 0s 0ms

Time for read in: 0h 0m 0s 3ms
Total time: 0h 0m 0s 9ms

Size of the sequence database: 100
Size of the alignment database: 100
Number of clusters: 56

Writing results 0h 0m 0s 0ms
Time for merging to pre_clust: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 10ms
createsubdb tmp//2011124407538103508/clu_tmp/15340415873745362011/order_redundancy tmp//2011124407538103508/clu_tmp/15340415873745362011/pref tmp//2011124407538103508/clu_tmp/15340415873745362011/pref_filter1 -v 3 --subdb-mode 1 

Time for merging to pref_filter1: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 0ms
filterdb tmp//2011124407538103508/clu_tmp/15340415873745362011/pref_filter1 tmp//2011124407538103508/clu_tmp/15340415873745362011/pref_filter2 --filter-file tmp//2011124407538103508/clu_tmp/15340415873745362011/order_redundancy --threads 64 --compressed 0 -v 3 

Filtering using file(s)
[=================================================================] 56 0s 3ms
Time for merging to pref_filter2: 0h 0m 0s 13ms
Time for processing: 0h 0m 0s 129ms
tmalign tmp//2011124407538103508/input tmp//2011124407538103508/input tmp//2011124407538103508/clu_tmp/15340415873745362011/pref_filter2 tmp//2011124407538103508/clu_tmp/15340415873745362011/aln.linclust --min-seq-id 0 -c 0.8 --cov-mode 0 --max-rejected 2147483647 --max-accept 2147483647 -a 0 --add-self-matches 0 --tmscore-threshold 0.5 --tmalign-hit-order 0 --tmalign-fast 1 --db-load-mode 0 --threads 64 -v 3 

Query database: tmp//2011124407538103508/input
Target database: tmp//2011124407538103508/input
[=================================================================] 56 0s 13ms
Time for merging to aln.linclust: 0h 0m 0s 13ms
Time for processing: 0h 0m 0s 143ms
createsubdb tmp//2011124407538103508/clu_tmp/15340415873745362011/order_redundancy tmp//2011124407538103508/input tmp//2011124407538103508/clu_tmp/15340415873745362011/pre_clustered_seqs -v 3 --subdb-mode 1 

Time for merging to pre_clustered_seqs: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 2ms
clust tmp//2011124407538103508/clu_tmp/15340415873745362011/pre_clustered_seqs tmp//2011124407538103508/clu_tmp/15340415873745362011/aln.linclust tmp//2011124407538103508/clu_tmp/15340415873745362011/clust.linclust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9 

Clustering mode: Set Cover
[=================================================================] 56 0s 0ms
Sort entries
Find missing connections
Found 87 new connections.
Reconstruct initial order
[=================================================================] 56 0s 0ms
Add missing connections
[=================================================================] 56 0s 0ms

Time for read in: 0h 0m 0s 4ms
Total time: 0h 0m 0s 9ms

Size of the sequence database: 56
Size of the alignment database: 56
Number of clusters: 14

Writing results 0h 0m 0s 0ms
Time for merging to clust.linclust: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 11ms
mergeclusters tmp//2011124407538103508/input tmp//2011124407538103508/clu_tmp/15340415873745362011/clu_redundancy tmp//2011124407538103508/clu_tmp/15340415873745362011/pre_clust tmp//2011124407538103508/clu_tmp/15340415873745362011/clust.linclust --threads 64 --compressed 0 -v 3 

Clustering step 1
[=================================================================] 56 0s 4ms
Clustering step 2
[=================================================================] 14 0s 15ms
Write merged clustering
[=================================================================] 100 0s 46ms
Time for merging to clu_redundancy: 0h 0m 0s 14ms
Time for processing: 0h 0m 0s 62ms
createsubdb tmp//2011124407538103508/clu_tmp/15340415873745362011/clu_redundancy tmp//2011124407538103508/input_ss tmp//2011124407538103508/clu_tmp/15340415873745362011/input_step_redundancy_ss -v 3 --subdb-mode 1 

Time for merging to input_step_redundancy_ss: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 0ms
createsubdb tmp//2011124407538103508/clu_tmp/15340415873745362011/clu_redundancy tmp//2011124407538103508/input_ca tmp//2011124407538103508/clu_tmp/15340415873745362011/input_step_redundancy_ca -v 3 --subdb-mode 1 

Time for merging to input_step_redundancy_ca: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 0ms
createsubdb tmp//2011124407538103508/clu_tmp/15340415873745362011/clu_redundancy tmp//2011124407538103508/input tmp//2011124407538103508/clu_tmp/15340415873745362011/input_step_redundancy -v 3 --subdb-mode 1 

Time for merging to input_step_redundancy: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
prefilter tmp//2011124407538103508/clu_tmp/15340415873745362011/input_step_redundancy_ss tmp//2011124407538103508/clu_tmp/15340415873745362011/input_step_redundancy_ss tmp//2011124407538103508/clu_tmp/15340415873745362011/pref_step0 --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 1 -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 100 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 0 --comp-bias-corr-scale 1 --diag-score 0 --exact-kmer-matching 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --min-ungapped-score 0 --add-self-matches 1 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 64 --compressed 0 -v 3 

Query database size: 14 type: Aminoacid
Estimated memory consumption: 977M
Target database size: 14 type: Aminoacid
Index table k-mer threshold: 154 at k-mer size 6 
Index table: counting k-mers
[=================================================================] 14 0s 0ms
Index table: Masked residues: 0
Error: Prefilter step 0 died
Error: Search died

Context

  1. I've tried tuning some parameters such as -s, -e, -k, --alignment-mode, --similarity-type, but none of them helps me work through this issue.
  2. Under the input PDB folder all of them have length = $50$, but they look similar to each other (with mainly $\alpha$-helix), by which I supposed might be one of the reasons causing errors. Maybe Foldseek cluster (TM-align mode) does not support clustering too similar proteins?
  3. This is case-specific as I adapt the same workflow to some other PDB folders (including some with lengths smaller than $50$), but they all worked pretty well.

Your Environment

tomato-cmyk commented 5 months ago

I have the same problem