soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.48k stars 200 forks source link

Profile based clustering not responding to --coverage parameter #698

Closed VictorTobiasson closed 1 year ago

VictorTobiasson commented 1 year ago

Main Issue

Dear mmseqs team,

I am a fairly new user of your software and have run into some unexpected behavior which I am unable to resolve. I am attempting to cluster a database of eukaryotic protein sequences (~1.8*10e6 sequences) using profile search based clustering. I am attempting to iterate or cascade the workflow described in "How to cluster using profiles". The issue I am encountering is that sequences of very different length are merged in clustering despite providing -c 0.8 --covmode 0 during the searches. This causes issues during cascaded clustering as single domain proteins are merged with multi-domain proteins.

Example output after one round of the protocol (described below, -c 0.8, --covmode 0)

poor_MSA

For basic cascaded clustering "mmseqs cluster" or single round clustering using "search" followed by "clust" the behavior appear to function as intended. Perhaps something in the profile generation or implementation of profile against consensus searches affects the interpretation of the -c parameter? Investigating the alignment data of the attached MSA with mmseqs convertalis (attached below) shows that all hits indeed passes the -c 0.8 cutoff? As such perhaps my understanding of what constitutes alignment coverage is lacking and in that case how would one go about retricting the "coverage" to only query-target pairs with lengths within 80% of each other? I have tried --covmode 5 with similar results.

Expected Behavior

Restriction of clustered protein sequence lengths by imposing cover limits via the -c parameter in conjunction with any --covmode

Current Behavior

Clusters containing sequences of vastly different lengths far outside the range of the -c cutoff.

Context

My protocol can be summarized roughly as:

1) Collapse paralogs and create cluster representatives in order to reduce database redundancy using; mmseqs cluster initial-database clusters -s 5 -c 0.8 --min-seq-id 0.9 --cluster_mode 0 --max-iterations 3 --max-seqs 100 --covmode 0

2) Iterate profile generation and searches of profiles against consensus sequences mmseqs search cluster-representatives cluster-representatives representative-search -s 7 -c 0.8 --covmode 0 --maxseqs 300 -e 0.003 mmseqs result2profile cluster-representatives cluster-representatives representative-search profiles mmseqs profile2consensus profiles initial-database consensus mmseqs search profiles consensus profile-search -s 7 -c 0.8 --covmode 0 --maxseqs 300 -e 0.003 mmseqs clust --clustermode 0 consensus profile-search profile-clusters mmseqs createsubdb profile-clusters initial-database new-cluster-representatives

Here new-cluster-representatives are used as input to round two of searches.

Alignment data

query target fident qcov tcov XP_645403.1 XP_640770.1 0.369 0.989 0.967 XP_635472.2 XP_640770.1 0.524 0.987 0.813 XP_638978.1 XP_640770.1 0.311 0.988 0.978 XP_640255.1 XP_640770.1 0.479 0.989 0.978 XP_629182.1 XP_640770.1 0.468 0.989 0.978 XP_645001.1 XP_640770.1 0.441 0.989 0.978 XP_640768.1 XP_640770.1 0.547 1.000 1.000 XP_629181.1 XP_640770.1 0.485 0.989 0.978 XP_643412.2 XP_640770.1 0.523 0.979 0.989 XP_635471.1 XP_640770.1 0.542 0.987 0.813 XP_629090.1 XP_640770.1 0.487 1.000 0.967 XP_640770.1 XP_640770.1 0.605 1.000 1.000 XP_640770.1 XP_640989.2 0.541 0.989 0.940 XP_640770.1 XP_640994.1 0.537 0.989 0.939 XP_640770.1 XP_640768.1 0.513 1.000 1.000 XP_640770.1 XP_640766.1 0.520 0.989 0.990 XP_640770.1 XP_640764.1 0.531 0.989 0.938 XP_640770.1 XP_640763.1 0.532 0.989 0.935 XP_640770.1 XP_640792.1 0.504 0.989 0.989 ... ...

MMseqs Output (for bugs)

Omitting the initial prefilering and providing one round of iterated profile based clustering output.

[-] Unloading mmseqs 2-13-45111-219-gaabc78c [+] Loading mmseqs 2-13-45111-219-gaabc78c result2profile euk72-90.profile euk72-90.consensus euk72-90.profile_search euk72-90.profile2 --threads 48

MMseqs Version: aabc78c298f35cbc7a4136206c1a83adaa68695f Substitution matrix aa:blosum62.out,nucl:nucleotide.out

Profile E-value threshold 0.001 Compositional bias 1 Compositional bias 1 Global sequence weighting false Allow deletions false Filter MSA 1 Use filter only at N seqs 0 Maximum seq. id. threshold 0.9 Minimum seq. id. 0.0 Minimum score per column -20 Minimum coverage 0 Select N most diverse seqs 1000 Pseudo count mode 0 Pseudo count a substitution:1.100,context:1.400 Pseudo count b substitution:4.100,context:5.800 Preload mode 0 Gap open cost aa:11,nucl:5 Gap extension cost aa:1,nucl:2 Gap pseudo count 10 Threads 48 Compressed 0 Verbosity 3

Query database size: 1270615 type: Profile Target database size: 1270615 type: Aminoacid [=================================================================] 1.27M 21m 19s 868ms Time for merging to euk72-90.profile2: 0h 0m 40s 828ms Time for processing: 0h 23m 5s 420ms profile2consensus euk72-90.profile2 euk72-90.consensus2 --threads 48

MMseqs Version: aabc78c298f35cbc7a4136206c1a83adaa68695f Substitution matrix aa:blosum62.out,nucl:nucleotide.out Max sequence length 65535 Threads 48 Compressed 0 Verbosity 3

[=================================================================] 1.27M 2m 13s 834ms Time for merging to euk72-90.consensus2: 0h 0m 0s 850ms Time for processing: 0h 2m 15s 974ms search euk72-90.profile2 euk72-90.consensus2 euk72-90.profile_search2 .mmseqs_tmp -s 7 -c 0.8 --threads 48

MMseqs Version: aabc78c298f35cbc7a4136206c1a83adaa68695f Substitution matrix aa:blosum62.out,nucl:nucleotide.out Add backtrace false Alignment mode 2 Alignment mode 0 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Coverage threshold 0.8 Coverage mode 0 Max sequence length 65535 Compositional bias 1 Compositional bias 1 Max reject 2147483647 Max accept 2147483647 Include identical seq. id. false Preload mode 0 Pseudo count a substitution:1.100,context:1.400 Pseudo count b substitution:4.100,context:5.800 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Correlation score weight 0 Gap open cost aa:11,nucl:5 Gap extension cost aa:1,nucl:2 Zdrop 40 Threads 48 Compressed 0 Verbosity 3 Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out Sensitivity 7 k-mer length 0 k-score seq:2147483647,prof:2147483647 Alphabet size aa:21,nucl:5 Max results per query 300 Split database 0 Split mode 2 Split memory limit 0 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask residues probability 0.9 Mask lower case residues 0 Minimum diagonal score 15 Selected taxa
Spaced k-mers 1 Spaced k-mer pattern
Local temporary path
Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Mask profile 1 Profile E-value threshold 0.1 Global sequence weighting false Allow deletions false Filter MSA 1 Use filter only at N seqs 0 Maximum seq. id. threshold 0.9 Minimum seq. id. 0.0 Minimum score per column -20 Minimum coverage 0 Select N most diverse seqs 1000 Pseudo count mode 0 Gap pseudo count 10 Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Add orf stop false Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Chain overlapping alignments 0 Merge query 1 Search type 0 Search iterations 1 Start sensitivity 4 Search steps 1 Exhaustive search mode false Filter results during exhaustive search 0 Strand selection 1 LCA search mode false Disk space limit 0 MPI runner
Force restart with latest tmp false Remove temporary files false

prefilter euk72-90.profile2 euk72-90.consensus2 .mmseqs_tmp/10144503605536445033/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 48 --compressed 0 -v 3 -s 7.0

Query database size: 1270615 type: Profile Estimated memory consumption: 7G Target database size: 1270615 type: Aminoacid Index table k-mer threshold: 0 at k-mer size 6 Index table: counting k-mers [=================================================================] 1.27M 6s 285ms Index table: Masked residues: 39476984 Index table: fill [=================================================================] 1.27M 5s 428ms Index statistics Entries: 522806522 DB size: 3479 MB Avg k-mer size: 8.168852 Top 10 k-mers IHDKNI 9354 DVSGLL 6729 LGGFVY 6623 SSSSSS 5575 IGGFVY 4538 LSCHLV 4001 EKDYGV 3708 LIMAGS 3629 FGVKLP 3541 KSVSVV 3538 Time for index table init: 0h 0m 12s 974ms Process prefiltering step 1 of 1

k-mer similarity threshold: 91 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 1270615 Target db start 1 to 1270615 [=================================================================] 1.27M 2h 31m 44s 357ms

1266.108246 k-mers per position 3373560 DB matches per sequence 626307 overflows 21 queries produce too many hits (truncated result) 299 sequences passed prefiltering per query sequence 300 median result list length 601 sequences with 0 size result lists Time for merging to pref_0: 0h 0m 0s 657ms Time for processing: 2h 32m 1s 811ms align euk72-90.profile2 euk72-90.consensus2 .mmseqs_tmp/10144503605536445033/pref_0 euk72-90.profile_search2 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 48 --compressed 0 -v 3

Compute score and coverage Query database size: 1270615 type: Profile Target database size: 1270615 type: Aminoacid Calculation of alignments [=================================================================] 1.27M 33m 52s 295ms Time for merging to euk72-90.profile_search2: 0h 0m 0s 650ms 109004199 alignments calculated 56769447 sequence pairs passed the thresholds (0.520801 of overall calculated) 44.678715 hits per query sequence Time for processing: 0h 33m 55s 605ms clust euk72-90.profile2 euk72-90.profile_search2 euk72-90.clust2 --threads 48

MMseqs Version: aabc78c298f35cbc7a4136206c1a83adaa68695f Cluster mode 0 Max connected component depth 1000 Similarity type 2 Threads 48 Compressed 0 Verbosity 3

Clustering mode: Set Cover [=================================================================] 1.27M 0s 909ms Sort entries Find missing connections Found 12659367 new connections. Reconstruct initial order [=================================================================] 1.27M 1s 538ms Add missing connections [=================================================================] 1.27M 13s 672ms

Your Environment

MMseqs Version: aabc78c298f35cbc7a4136206c1a83adaa68695f

VictorTobiasson commented 1 year ago

Apologies but it appears that this is not a real issue. The errors in the alignment were artifacts of extracting alignment data from a cluster_DB not the underlying alignment_DB. Proper association of alignment data with the cluster sequences reveals that the coverage parameter works as intended.

This issue can be closed and deleted.

Best Victor