ruthalee commented 4 months ago

Hello, I have been using foldseek to cluster pdbs folded with alphafold and it had been working perfectly. Now I am getting an error trying to cluster. Here is the code. I have left out all of the [=====> for brevity.

(foldseek) me@ahl03 [~/packages/fold_seek] % foldseek easy-cluster Bins_Geo_Shew_ranked_0_pdbs Bins_Geo_Shew tmp -c 0.8 --cov-mode 0

Create directory tmp easy-cluster Bins_Geo_Shew_ranked_0_pdbs Bins_Geo_Shew tmp -c 0.8 --cov-mode 0

MMseqs Version: 8.ef4e960 Substitution matrix aa:3di.out,nucl:3di.out Seed substitution matrix aa:3di.out,nucl:3di.out Sensitivity 4 k-mer length 0 Target search mode 0 k-score seq:2147483647,prof:2147483647 Max sequence length 65535 Max results per query 300 Split database 0 Split mode 2 Split memory limit 0 Coverage threshold 0.8 Coverage mode 0 Compositional bias 1 Compositional bias 1 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask residues probability 0.9 Mask lower case residues 1 Minimum diagonal score 30 Selected taxa
Spaced k-mers 1 Preload mode 0 Spaced k-mer pattern
Local temporary path
Threads 256 Compressed 0 Verbosity 3 TMscore threshold 0 LDDT threshold 0 Sort by structure bit score 1 Alignment type 2 Add backtrace false Alignment mode 0 Alignment mode 0 E-value threshold 10 Seq. id. threshold 0 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Max reject 2147483647 Max accept 2147483647 Gap open cost aa:10,nucl:10 Gap extension cost aa:1,nucl:1 TMalign hit order 0 TMalign fast 1 Cluster mode 0 Max connected component depth 1000 Similarity type 2 Weight file name
Cluster Weight threshold 0.9 Single step clustering false Cascaded clustering steps 3 Cluster reassign false Remove temporary files true Force restart with latest tmp false MPI runner
k-mers per sequence 21 Scale k-mers per sequence aa:0.000,nucl:0.200 Adjust k-mer length false Shift hash 67 Include only extendable false Skip repeating k-mers false Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Chain name mode 0 Write mapping file 0 Mask b-factor threshold 0 Coord store mode 2 Write lookup file 1 Tar Inclusion Regex . Tar Exclusion Regex ^$ File Inclusion Regex . File Exclusion Regex ^$

createdb Bins_Geo_Shew_ranked_0_pdbs tmp/15597438964095582814/input --chain-name-mode 0 --write-mapping 0 --mask-bfactor-threshold 0 --coord-store-mode 2 --write-lookup 1 --tar-include '.' --tar-exclude '^$' --file-include '.' --file-exclude '^$' --threads 256 -v 3

Output file: tmp/15597438964095582814/input

Time for merging to input_ss: 0h 0m 1s 158ms Time for merging to input_h: 0h 0m 1s 209ms Time for merging to input_ca: 0h 0m 1s 264ms Time for merging to input: 0h 0m 1s 51ms Ignore 0 out of 985. Too short: 0, incorrect: 0, not proteins: 0. Time for processing: 0h 0m 42s 56ms Create directory tmp/15597438964095582814/clu_tmp cluster tmp/15597438964095582814/input tmp/15597438964095582814/clu tmp/15597438964095582814/clu_tmp -c 0.8 --cov-mode 0 --remove-tmp-files 1

Set cluster sensitivity to -s 8.000000 Set cluster mode SET COVER Set cluster iterations to 3 kmermatcher tmp/15597438964095582814/input_ss tmp/15597438964095582814/clu_tmp/11654376807347694794/pref --sub-mat 'aa:3di.out,nucl:3di.out' --alph-size aa:21,nucl:5 --min-seq-id 0 --kmer-per-seq 300 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 256 --compressed 0 -v 3 --cluster-weight-threshold 0.9

kmermatcher tmp/15597438964095582814/input_ss tmp/15597438964095582814/clu_tmp/11654376807347694794/pref --sub-mat 'aa:3di.out,nucl:3di.out' --alph-size aa:21,nucl:5 --min-seq-id 0 --kmer-per-seq 300 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 256 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Database size: 985 type: Aminoacid Reduced amino acid alphabet: (A F) (C V) (D B) (E Z) (G H) (I M T) (K W) (L J) (N R S) (P) (Q) (Y) (X)

Generate k-mers list for 1 split

Sort kmer 0h 0m 14s 954ms Sort by rep. sequence 0h 0m 1s 18ms Time for fill: 0h 0m 0s 2ms Time for merging to pref: 0h 0m 0s 5ms Time for processing: 0h 0m 26s 949ms structurerescorediagonal tmp/15597438964095582814/input tmp/15597438964095582814/input tmp/15597438964095582814/clu_tmp/11654376807347694794/pref tmp/15597438964095582814/clu_tmp/11654376807347694794/pref_rescore1 --tmscore-threshold 0 --lddt-threshold 0 --alignment-type 2 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.01 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 256 --compressed 0 -v 3

Time for merging to pref_rescore1: 0h 0m 0s 140ms Time for processing: 0h 0m 10s 736ms clust tmp/15597438964095582814/input tmp/15597438964095582814/clu_tmp/11654376807347694794/pref_rescore1 tmp/15597438964095582814/clu_tmp/11654376807347694794/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 256 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Set Cover

Sort entries Find missing connections Found 3532 new connections. Reconstruct initial order

Add missing connections

Time for read in: 0h 0m 17s 5ms Total time: 0h 0m 23s 840ms

Size of the sequence database: 985 Size of the alignment database: 985 Number of clusters: 442

Writing results 0h 0m 0s 0ms Time for merging to pre_clust: 0h 0m 0s 970ms Time for processing: 0h 0m 24s 861ms createsubdb tmp/15597438964095582814/clu_tmp/11654376807347694794/order_redundancy tmp/15597438964095582814/clu_tmp/11654376807347694794/pref tmp/15597438964095582814/clu_tmp/11654376807347694794/pref_filter1 -v 3 --subdb-mode 1

Time for merging to pref_filter1: 0h 0m 0s 6ms Time for processing: 0h 0m 0s 123ms filterdb tmp/15597438964095582814/clu_tmp/11654376807347694794/pref_filter1 tmp/15597438964095582814/clu_tmp/11654376807347694794/pref_filter2 --filter-file tmp/15597438964095582814/clu_tmp/11654376807347694794/order_redundancy --threads 256 --compressed 0 -v 3

Filtering using file(s)

Time for merging to pref_filter2: 0h 0m 0s 148ms Time for processing: 0h 0m 8s 119ms structurealign tmp/15597438964095582814/input tmp/15597438964095582814/input tmp/15597438964095582814/clu_tmp/11654376807347694794/pref_filter2 tmp/15597438964095582814/clu_tmp/11654376807347694794/aln.linclust --tmscore-threshold 0 --lddt-threshold 0 --sort-by-structure-bits 0 --alignment-type 2 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.01 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 256 --compressed 0 -v 3

tmp/15597438964095582814/clu_tmp/11654376807347694794/clustering.sh: line 123: 2986178 Killed $RUNNER "$MMSEQS" $ALIGNMENT_ALGO "${INPUT}${ALN_EXTENSION}" "${INPUT}${ALN_EXTENSION}" "${TMP_PATH}/pref_filter2" "${TMP_PATH}/aln.linclust" ${ALIGNMENT_PAR} Error: Alignment step died Error: Search died

I looked up the clustering.sh line:

4. Clustering using greedy set cover.

if notExists "${TMP_PATH}/clust.linclust.dbtype"; then

shellcheck disable=SC2086,SC2153

  "$MMSEQS" clust "${TMP_PATH}/pre_clustered_seqs" "${TMP_PATH}/aln.linclust" "${TMP_PATH}/clust.linclust" ${CLUSTER_PAR} \
      || fail "Clustering step died"

fi

if notExists "${TMP_PATH}/clu_redundancy.dbtype"; then

shellcheck disable=SC2086

  if [ "${RUN_ITERATIVE}" = "1" ]; then
     "$MMSEQS" mergeclusters "$SOURCE" "${TMP_PATH}/clu_redundancy" "${TMP_PATH}/pre_clust" "${TMP_PATH}/clust.linclust" $MERGECLU_PAR \
        || fail "mergeclusters died"
  else
     "$MMSEQS" mergeclusters "$SOURCE" "$2" "${TMP_PATH}/pre_clust" "${TMP_PATH}/clust.linclust" $MERGECLU_PAR \
        || fail "mergeclusters died"
  fi

fi fi <----- line 123

I installed foldseek with mamba into its own environment on a linux x64 system. After this problem I ran foldseek on a pdb set I had run before and it did not work. I uninstalled and reinstalled foldseek in the off chance something weird happened in my environment, but it did not fix the problem. Any idea what is happening? Thanks so much!

milot-mirdita commented 4 months ago

Killed sounds like the out-of-memory killer of the the operating system killed the process for using too much RAM. What's odd is that the input set size is really small (985) and that the alignment is usually not the step to cause issues.

Are there any absurdly long proteins in the set?

ruthalee commented 4 months ago

Thank you! That was the problem. I did have a couple of proteins ~ 2300 aa long. Fortunately I am on an HPC cluster and can just increase the RAM usage. I appreciate your help!

martin-steinegger commented 4 months ago

@ruthalee is this resolved?

ruthalee commented 4 months ago

@martin-steinegger yes, thank you

steineggerlab / foldseek

Error: Alignment step died - new problem after previous successful use #241

4. Clustering using greedy set cover.

shellcheck disable=SC2086,SC2153

shellcheck disable=SC2086