Open silvainp opened 1 year ago
Hi @silvainp @milot-mirdita @martin-steinegger !
I came across this old issue while googling because I encountered pretty much the same problem and this didn't seem to have been resolved yet. I can confirm that I got very similar outputs to @silvainp in a clustering problem I've been working on. The issue persisted with the most recent release (15) of mmseqs2.
After some I found that there was a (silent) issue in my input FAA files: they were corrupted due to a few binary characters that were accidentally introduced in an upstream step. Once I manually cleared all binary characters, clustering worked as intended. This may have been the issue here as well, although ofc hard to tell from a distance.
What's important imo is that the createdb
step did not catch these and threw no error/warning. @milot-mirdita I don't want to tinker with the code for a pull request, but I'd suggest building in a check to validate the input fasta files in the db-building step.
Best regards,
Sebastian
Hi all, thanks for this MMseqs2 that seems very efficient.
Unfortunately it seems to not be willing to run on my machine :
mmseqs easy-cluster /Users/s/Documents/Albatros/protein//short_name-Group/true_plus300_proteins-13-strains_shortname-group.fa /Users/s/Documents/short_name-Group/clusterRes /Volumes/s/tmp --min-seq-id 0.5 -c 0.8 --cov-mode 1 Create directory /Volumes/s/tmp easy-cluster /Users/s/Documents/Albatros/short_name-Group/true_plus300_proteins-13-strains_shortname-group.fa /Users/s/Documents/Albatros/protein/short_name-Group/clusterRes /Volumes/s/tmp --min-seq-id 0.5 -c 0.8 --cov-mode 1
MMseqs Version: 14-7e284 Substitution matrix aa:blosum62.out,nucl:nucleotide.out Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out Sensitivity 4 k-mer length 0 k-score seq:2147483647,prof:2147483647 Alphabet size aa:21,nucl:5 Max sequence length 65535 Max results per query 20 Split database 0 Split mode 2 Split memory limit 0 Coverage threshold 0.8 Coverage mode 1 Compositional bias 1 Compositional bias 1 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask residues probability 0.9 Mask lower case residues 0 Minimum diagonal score 15 Selected taxa Include identical seq. id. false Spaced k-mers 1 Preload mode 0 Pseudo count a substitution:1.100,context:1.400 Pseudo count b substitution:4.100,context:5.800 Spaced k-mer pattern Local temporary path Threads 16 Compressed 0 Verbosity 3 Add backtrace false Alignment mode 3 Alignment mode 0 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0.5 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Max reject 2147483647 Max accept 2147483647 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Correlation score weight 0 Gap open cost aa:11,nucl:5 Gap extension cost aa:1,nucl:2 Zdrop 40 Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Cluster mode 0 Max connected component depth 1000 Similarity type 2 Single step clustering false Cascaded clustering steps 3 Cluster reassign false Remove temporary files true Force restart with latest tmp false MPI runner k-mers per sequence 21 Scale k-mers per sequence aa:0.000,nucl:0.200 Adjust k-mer length false Shift hash 67 Include only extendable false Skip repeating k-mers false Database type 0 Shuffle input database true Createdb mode 1 Write lookup file 0 Offset of numeric ids 0
createdb /Users/s/Documents/protein/true_plus300_proteins-13-strains_shortname-group.fa /Volumes/s/tmp/3581369344000996149/input --dbtype 0 --shuffle 1 --createdb-mode 1 --write-lookup 0 --id-offset 0 --compressed 0 -v 3
Shuffle database cannot be combined with --createdb-mode 0 We recompute with --shuffle 0 Converting sequences [1718] 0s 24ms Time for merging to input_h: 0h 0m 0s 10ms Time for merging to input: 0h 0m 0s 10ms Database type: Aminoacid Time for processing: 0h 0m 0s 81ms Create directory /Volumes/s/tmp/3581369344000996149/clu_tmp cluster /Volumes/s/tmp/3581369344000996149/input /Volumes/s/tmp/3581369344000996149/clu /Volumes/s/tmp/3581369344000996149/clu_tmp --max-seqs 20 -c 0.8 --cov-mode 1 --spaced-kmer-mode 1 --alignment-mode 3 -e 0.001 --min-seq-id 0.5 --remove-tmp-files 1
Set cluster sensitivity to -s 3.000000 Set cluster mode GREEDY MEM Set cluster iterations to 3 linclust /Volumes/s/tmp/3581369344000996149/input /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/clu_redundancy /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust --cluster-mode 3 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 1 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --alph-size aa:13,nucl:5 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 -k 0 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 1 --force-reuse 0
kmermatcher /Volumes/s/tmp/3581369344000996149/input /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0.5 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --cov-mode 1 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 16 --compressed 0 -v 3
kmermatcher /Volumes/s/tmp/3581369344000996149/input /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0.5 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --cov-mode 1 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 16 --compressed 0 -v 3
Database size: 1807 type: Aminoacid Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)
Generate k-mers list for 1 split [=================================================================] 100.00% 1.81K 0s 10ms Sort kmer 0h 0m 0s 3ms Sort by rep. sequence 0h 0m 0s 2ms Time for fill: 0h 0m 0s 1ms Time for merging to pref: 0h 0m 0s 11ms Time for processing: 0h 0m 0s 63ms rescorediagonal /Volumes/s/tmp/3581369344000996149/input /Volumes/s/tmp/3581369344000996149/input /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/pref /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/pref_rescore1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov-mode 1 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 16 --compressed 0 -v 3
[=================================================================] 100.00% 1.81K 0s 7ms Time for merging to pref_rescore1: 0h 0m 0s 64ms Time for processing: 0h 0m 0s 310ms clust /Volumes/s/tmp/3581369344000996149/input /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/pref_rescore1 /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/pre_clust --cluster-mode 3 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3
Clustering mode: Greedy Low Mem Total time: 0h 0m 0s 13ms
Size of the sequence database: 1807 Size of the alignment database: 1807 Number of clusters: 757
Writing results 0h 0m 0s 0ms Time for merging to pre_clust: 0h 0m 0s 11ms Time for processing: 0h 0m 0s 49ms createsubdb /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/order_redundancy /Volumes/s/tmp/3581369344000996149/input /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/input_step_redundancy -v 3 --subdb-mode 1
Time for merging to input_step_redundancy: 0h 0m 0s 10ms Time for processing: 0h 0m 0s 70ms createsubdb /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/order_redundancy /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/pref /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/pref_filter1 -v 3 --subdb-mode 1
Time for merging to pref_filter1: 0h 0m 0s 10ms Time for processing: 0h 0m 0s 41ms filterdb /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/pref_filter1 /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/pref_filter2 --filter-file /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/order_redundancy --threads 16 --compressed 0 -v 3
Filtering using file(s) [=================================================================] 100.00% 757 0s 4ms Time for merging to pref_filter2: 0h 0m 0s 62ms Time for processing: 0h 0m 0s 270ms rescorediagonal /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/input_step_redundancy /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/input_step_redundancy /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/pref_filter2 /Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/pref_rescore2 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.8 -a 0 --cov-mode 1 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 16 --compressed 0 -v 3
/Volumes/s/tmp/3581369344000996149/clu_tmp/10544097544295592317/linclust/18419612973359567408/linclust.sh: line 68: 88046 Segmentation fault: 11 $RUNNER "$MMSEQS" rescorediagonal "$INPUT" "$INPUT" "$RESULTDB" "${TMP_PATH}/pref_rescore2" ${UNGAPPED_ALN_PAR} Error: Ungapped alignment step died Error: linclust died Error: Search died
Running on Mac intel i9 and mmseqs2 14-7e284
Thank you all for your help.