[Question] How do I cluster Uniref90 after it has been translated into a 5 letter alphabet?

Ieremie commented 1 year ago

Expected Behavior

Cluster Uniref90 after it has been translated into a 5-letter alphabet. Cluster it to 90% identity and 80% overlap (similar to how Uniref90 was initially generated).

Current Behavior

mmseq linclust runs fast, but returns a large number of clusters mmseq cluster takes a long time at the prefiltering stage

Steps to Reproduce (for bugs)

# $1 = alphabet name
mmseqs createdb "uniref90-$1.fasta" "uniref90-DB-$1"
mmseqs linclust "uniref90-DB-$1"  "uniref90-DB-$1_clu" tmp/ --cov-mode 2 -c 0.8 --min-seq-id 0.9
mmseqs createsubdb "uniref90-DB-$1_clu"  "uniref90-DB-$1"  "uniref90-DB-$1_clu_rep"
mmseqs convert2fasta "uniref90-DB-$1_clu_rep"  "uniref90-DB-$1_clu_rep.fasta"

MMseqs Output (for bugs)

Mseqs Version: 15.6f452 Database type 0 Shuffle input database true Createdb mode 0 Write lookup file 1 Offset of numeric ids 0 Compressed 0 Verbosity 3

Converting sequences !!! cut the part here to make it smaller !!! 76 Mio. sequences processed Time for merging to uniref90-DB-wwmj_h: 0h 0m 28s 629ms Time for merging to uniref90-DB-wwmj: 0h 0m 52s 227ms Database type: Aminoacid Time for processing: 0h 8m 25s 481ms

cluster uniref90-DB-wwmj uniref90-DB-wwmj_clu tmp/ --cov-mode 2 -c 0.8 --min-seq-id 0.9

MMseqs Version: 15.6f452 Substitution matrix aa:blosum62.out,nucl:nucleotide.out Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out Sensitivity 4 k-mer length 0 Target search mode 0 k-score seq:2147483647,prof:2147483647 Alphabet size aa:21,nucl:5 Max sequence length 65535 Max results per query 20 Split database 0 Split mode 2 Split memory limit 0 Coverage threshold 0.8 Coverage mode 2 Compositional bias 1 Compositional bias 1 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask residues probability 0.9 Mask lower case residues 0 Minimum diagonal score 15 Selected taxa
Include identical seq. id. false Spaced k-mers 1 Preload mode 0 Pseudo count a substitution:1.100,context:1.400 Pseudo count b substitution:4.100,context:5.800 Spaced k-mer pattern
Local temporary path
Threads 64 Compressed 0 Verbosity 3 Add backtrace false Alignment mode 3 Alignment mode 0 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0.9 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Max reject 2147483647 Max accept 2147483647 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Correlation score weight 0 Gap open cost aa:11,nucl:5 Gap extension cost aa:1,nucl:2 Zdrop 40 Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Cluster mode 0 Max connected component depth 1000 Similarity type 2 Weight file name
Cluster Weight threshold 0.9 Single step clustering false Cascaded clustering steps 3 Cluster reassign false Remove temporary files false Force restart with latest tmp false MPI runner
k-mers per sequence 21 Scale k-mers per sequence aa:0.000,nucl:0.200 Adjust k-mer length false Shift hash 67 Include only extendable false Skip repeating k-mers false

Set cluster sensitivity to -s 1.000000 Set cluster mode GREEDY MEM Set cluster iterations to 1 linclust uniref90-DB-wwmj tmp//16518381706844811497/clu_redundancy tmp//16518381706844811497/linclust --cluster-mode 3 --max-iterations 1000 --similarity -type 2 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignme nt-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 2 --max-seq-len 65535 --comp -bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,co ntext:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --alph-size aa:13,nucl:5 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0. 000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 -k 0 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 - -ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 0 --force-reuse 0

kmermatcher uniref90-DB-wwmj tmp//16518381706844811497/linclust/1189823685496030198/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:1 3,nucl:5 --min-seq-id 0.9 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 -- mask-lower-case 0 --cov-mode 2 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 - -threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Database size: 76215872 type: Aminoacid Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)

Generate k-mers list for 1 split [=================================================================] 76.22M 6m 32s 415ms Sort kmer 0h 0m 14s 675ms Sort by rep. sequence 0h 0m 0s 974ms Time for fill: 0h 0m 13s 496ms Time for merging to pref: 0h 0m 0s 0ms Time for processing: 0h 7m 46s 594ms rescorediagonal uniref90-DB-wwmj uniref90-DB-wwmj tmp//16518381706844811497/linclust/1189823685496030198/pref tmp//16518381706844811497/linclust/11898236 85496030198/pref_rescore1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov -mode 2 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 64 --compressed 0 -v 3

[================================================================] =76.22M 19m 41s 19ms Time for merging to pref_rescore1: 0h 0m 50s 277ms Time for processing: 0h 21m 16s 644ms clust uniref90-DB-wwmj tmp//16518381706844811497/linclust/1189823685496030198/pref_rescore1 tmp//16518381706844811497/linclust/1189823685496030198/pre_cl ust --cluster-mode 3 --max-iterations 1000 --similarity-type 2 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Greedy Low Mem Total time: 0h 0m 28s 473ms

Size of the sequence database: 76215872 Size of the alignment database: 76215872 Number of clusters: 74926526

Writing results 0h 0m 17s 970ms Time for merging to pre_clust: 0h 0m 0s 1ms Time for processing: 0h 1m 0s 559ms createsubdb tmp//16518381706844811497/linclust/1189823685496030198/order_redundancy uniref90-DB-wwmj tmp//16518381706844811497/linclust/11898236854960301 98/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms Time for processing: 0h 0m 28s 72ms createsubdb tmp//16518381706844811497/linclust/1189823685496030198/order_redundancy tmp//16518381706844811497/linclust/1189823685496030198/pref tmp//1651 8381706844811497/linclust/1189823685496030198/pref_filter1 -v 3 --subdb-mode 1

Time for merging to pref_filter1: 0h 0m 0s 0ms Time for processing: 0h 0m 39s 536ms filterdb tmp//16518381706844811497/linclust/1189823685496030198/pref_filter1 tmp//16518381706844811497/linclust/1189823685496030198/pref_filter2 --filter -file tmp//16518381706844811497/linclust/1189823685496030198/order_redundancy --threads 64 --compressed 0 -v 3

Filtering using file(s) [=================================================================] 74.93M 16s 227ms Time for merging to pref_filter2: 0h 0m 46s 160ms Time for processing: 0h 1m 35s 246ms rescorediagonal tmp//16518381706844811497/linclust/1189823685496030198/input_step_redundancy tmp//16518381706844811497/linclust/1189823685496030198/input _step_redundancy tmp//16518381706844811497/linclust/1189823685496030198/pref_filter2 tmp//16518381706844811497/linclust/1189823685496030198/pref_rescore2 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.8 -a 0 --cov-mode 2 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 64 --compressed 0 -v 3

[================================================================] =74.93M 28m 19s 496ms Time for merging to pref_rescore2: 0h 0m 51s 111ms Time for processing: 0h 29m 44s 106ms align tmp//16518381706844811497/linclust/1189823685496030198/input_step_redundancy tmp//16518381706844811497/linclust/1189823685496030198/input_step_redu ndancy tmp//16518381706844811497/linclust/1189823685496030198/pref_rescore2 tmp//16518381706844811497/linclust/1189823685496030198/aln --sub-mat 'aa:blos um62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mo de 0 --alt-ali 0 -c 0.8 --cov-mode 2 --max-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 -- add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-sc ore-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 64 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 74926526 type: Aminoacid Target database size: 74926526 type: Aminoacid Calculation of alignments [=================================================================] 74.93M 3m 46s 500ms Time for merging to aln: 0h 0m 54s 46ms 92521862 alignments calculated 75488967 sequence pairs passed the thresholds (0.815904 of overall calculated) 1.007507 hits per query sequence Time for processing: 0h 5m 20s 32ms clust tmp//16518381706844811497/linclust/1189823685496030198/input_step_redundancy tmp//16518381706844811497/linclust/1189823685496030198/aln tmp//165183 81706844811497/linclust/1189823685496030198/clust --cluster-mode 3 --max-iterations 1000 --similarity-type 2 --threads 64 --compressed 0 -v 3 --cluster-w eight-threshold 0.9

Clustering mode: Greedy Low Mem Total time: 0h 0m 10s 117ms

Size of the sequence database: 74926526 Size of the alignment database: 74926526 Number of clusters: 74390789

Writing results 0h 0m 17s 406ms Time for merging to clust: 0h 0m 0s 0ms Time for processing: 0h 0m 44s 227ms mergeclusters uniref90-DB-wwmj tmp//16518381706844811497/clu_redundancy tmp//16518381706844811497/linclust/1189823685496030198/pre_clust tmp//16518381706 844811497/linclust/1189823685496030198/clust --threads 64 --compressed 0 -v 3

Clustering step 1 [=================================================================] 74.93M 20s 981ms Clustering step 2 [=================================================================] 74.39M 1m 0s 461ms Write merged clustering [=================================================================] 76.22M 1m 11s 703ms Time for merging to clu_redundancy: 0h 0m 49s 731ms Time for processing: 0h 2m 22s 84ms createsubdb tmp//16518381706844811497/clu_redundancy uniref90-DB-wwmj tmp//16518381706844811497/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms Time for processing: 0h 0m 29s 367ms prefilter tmp//16518381706844811497/input_step_redundancy tmp//16518381706844811497/input_step_redundancy tmp//16518381706844811497/pref_step0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 1 -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2 147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 2 --comp-bias-corr 0 --comp-bias-corr-scale 1 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 0 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 64 --compressed 0 -v 3

Query database size: 74390789 type: Aminoacid Estimated memory consumption: 400G Target database size: 74390789 type: Aminoacid Index table k-mer threshold: 174 at k-mer size 7 Index table: counting k-mers [=================================================================] 74.39M 1m 27s 612ms Index table: Masked residues: 17937595794 Index table: fill [=================================================================] 74.39M 13s 623ms Index statistics Entries: 495972652 DB size: 12603 MB Avg k-mer size: 0.387479 Top 10 k-mers CSSCCCS 40539 CSSCSCS 36980 CSCSSCS 36180 CSCSCSS 35290 CSSCDSS 34425 CSCCSSS 33905 CSSCSSS 32740 CSCDSSS 31596 SCSCDSS 30280 CSCSDSS 29715 Time for index table init: 0h 2m 0s 290ms Process prefiltering step 1 of 1

k-mer similarity threshold: 174 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 74390789 Target db start 1 to 74390789 [==

Context

I have translated all the sequences in Uniref90 based on a reduced amino acid alphabet that has only 5 letters (C, A, D, G, S). I have tried clustering it again using linclust to 90% identity and 80% overlap, however, the number of clusters is larger than I expected (75M from 76M initial Uniref90 size). I thought that might be due to the reduced sensitivity as a further reduced amino acid alphabet is applied before computing the k-mers.

I've tried to use mmseq cluster instead, however, this seems to take a very long time. After 13h the progress bar has only two '=' signs.

I am not sure if I am approaching the right way. Is linclust capable to cluster a protein sequence database that has been translated to a reduced amino acid alphabet or does it interfere with the mmseq alphabet used during k-mer comparison?

I was expecting a much smaller number of clusters as sequences only use 5 letters now so they are more likely to be more similar.

For mmseq cluster, what is the expected computation time for 76M sequences?

Your Environment

Running on a 64-core and 500GB RAM Linux machine.

milot-mirdita commented 1 year ago

That's an interesting application of MMseqs2's clustering. It should be possible to do what you want however it will require much more parameter tweaking. Also did you generate your own substitution matrix?

I think the main point is to increase the k-mer size as you will generate tons of spurious matches with 6/7-mers and a alphabet size of 5. Increasing the k-mer size of cluster will be tricky since it will start to use enormous amounts of memory if you don't also reduce the alphabet size, however reducing the alphabet size will generate it's own reduced alphabet that will likely conflict with yours.

So linclust is probably the way to go, try a k-mer size of 11 and see if that works.

Ieremie commented 1 year ago

Hi @milot-mirdita, thanks for the quick reply!

I am not creating a new substitution matrix. I am using the default parameters. Increasing the k-mer size seems intuitive, I'll try that.

I am doing this for multiple alphabet reduction schemes to see the amount of reduction in dataset size. Should I keep the k-mer size of 11 across the alphabets or scale it based on the alphabet size? (e.g. longer k-mers for smaller alphabets).

milot-mirdita commented 1 year ago

No guarantees but that's why I would first try. Try k-mer sizes from 6 to 15. More things might go wrong though, as you are breaking some pretty fundamental assumptions.

You could also try to hack your reduction schemes directly into MMseqs2: https://github.com/soedinglab/MMseqs2/blob/master/src/prefiltering/ReducedMatrix.cpp

However, reduced alphabets are not being used for alignment, only for prefiltering/kmermatching.

Ieremie commented 1 year ago

Hi @milot-mirdita. I've done a parameter search over the kmer length and number of kmers. It seems that for small alphabets there is a need for a higher number of kmers. I am assuming it is hard to find kmer diagonal matches if there are many spurious examples.

What is the effect of specifying the alphabet size? For example, setting it to 21, does that mean that the mmseqs alphabet is not used anymore? I am not sure what other parameters might help clustering these Frankenstein looking fasta files.

linclust

milot-mirdita commented 1 year ago

Changing the alphabet size will cause it to use MMseqs2's built-in alphabet reduction. Since you seem to be trying various reduced alphabets I assume that you don't want to it to interfere with your procedure. I didn't think about this last time, but by default linclust does a reduction to a 13 letter alphabet. You should set the alphabetsize to 21 (20+X) so it does not interfere.

One more thing for linclust: You should try to also increase --kmer-per-seq to about 80 to generate more k-mers per sequence. This does increase sensitivity.

Ieremie commented 1 year ago

Thanks for the response @milot-mirdita.

I've explored that parameter too. It seems that Kmer length: 25 | Nr kmers: 160 works the best for small alphabets. I'll see if using the full alphabet improves the clustering.

soedinglab / MMseqs2