soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.36k stars 190 forks source link

--cov-mode -c in linclust and cluster #73

Closed kad-ecoli closed 6 years ago

kad-ecoli commented 6 years ago

In the MMseqs2 wiki https://github.com/soedinglab/MMseqs2/wiki#how-to-set-the-right-alignment-coverage-to-cluster It explains

--cov-mode 0 --cov-mode 1 --cov-mode 2

are long-cov, target-cov, and query-cov, respectively. While the meaning of query/target is self-evident in many-against-many sequence search, I do not understand what does query and target mean in sequence clustering (linclust or cluster).

For example, https://bitbucket.org/martin_steinegger/linclust-analysis/src/281fa10016de9f206e3d98d89b52d8b14988c6ca/metaclust/clustering.sh It seems metaclust are clustered with target-cov (--cov-mode 1)

Also, I wonder if it is possible to set the following coverage requirement: For each member sequence in a cluster, it must be covered by greater than -c 0.9 by the cluster representative; whereas the coverage of representative by each of its member can be as low as -c 0.

For example:

cluster representative: MAVGTACRPA cluster member: -AVGTAC---

The member coverage would be 6/6=100% (OK)

cluster representative: -AVGTAC--- cluster member: MAVGTACRPA

The member coverage would be 6/10=60% (not OK)

martin-steinegger commented 6 years ago

The covergae mode you mentioned is not possible. Linclust picks at the kmermatcher the longest sequence as representative. So the only case possible is:

cluster representative: MAVGTACRPA
cluster member: -AVGTAC---
kad-ecoli commented 6 years ago

If the representative is always longer than any of its member, how should I set --cov-mode to ensure than each member is covered by greater than -c 0.9 by its representative?

martin-steinegger commented 6 years ago

Target coverage --cov-mode 1. All sequences are clustered that have a sequence length overlap greater than -c X% to the representative sequence.

kad-ecoli commented 6 years ago

I am still not sure what do you mean by "All sequences are clustered that have a sequence length overlap greater than -c X% to the representative sequence."

Let us say we have the following case when --cov-mode 1:

representative IIVGTAA- cluster member --VGTAAL

what is the coverage, 4/6 or 4/5? Is --cov-mode 1 the correct flag if I want mmseqs to consider the coverage as 4/5 (i.e. normalized by member length) for the above case?

kad-ecoli commented 6 years ago

In the context of cluster and linclust, which one is query/target? In other words, does "query" refer to cluster representative or "target" refer to cluster representative.

ksteczk commented 6 years ago

Let me join the discussion. For I understand how all the above work when using --clustering-mode 2 (the cd-hit style), I totally don't understand whats happening when --clustering-mode 1 is chosen. The wiki says:

In connected component clustering starting at the mostly connected vertex, all vertices that are reachable in a breadth-first search are members of the cluster.

So what if the spoken vertex is kind of short and way not representative how do particular --cov-modes work then?

martin-steinegger commented 6 years ago

I updated the documentation.

MMseqs2/Linclust and Linclust has three main criteria to link two sequences by an edge:

(1) a maximum E-value threshold (option -e )

(2) a minimum sequence identity (--min-seq-id [0,1]) with option --alignment-mode 3 defined as the number of identical aligned residues divided by the number of aligned columns including internal gap columns, or, by default, defined by a highly correlated measure, the equivalent similarity score of the local alignment (including gap penalties) divided by the maximum of the lengths of the two locally aligned sequence segments. The score per residue equivalent to a certain sequence identity is obtained by a linear regression using thousands of local alignments as training set.

(3) a minimum coverage (option -c [0,1], which is defined by the number of aligned residue pairs divided by either the minimum of the length of query/centre and target/non-centre sequences (default mode, --cov-mode 0), or by the length of the target/non-centre sequence (--cov-mode 1), or by the length of the query/centre (--cov-mode 2);

https://github.com/soedinglab/MMseqs2/wiki#clustering-criteria

@ksteczk --clustering-mode 1 is similar to BLASTclust. Each connected component is a seperate cluster. See https://en.wikipedia.org/wiki/Connected_component_(graph_theory)

kad-ecoli commented 6 years ago

Thank you very much. This is exactly the answer I am looking for.

kad-ecoli commented 6 years ago

It seems the document has two conflicting pieces of statement. In https://github.com/soedinglab/MMseqs2/wiki#clustering-criteria it is stated that "--cov-mode 0" is "the number of aligned residue pairs divided by either the minimum of the length of query/centre and target/non-centre sequences"

However, in https://github.com/soedinglab/MMseqs2/wiki#how-to-set-the-right-alignment-coverage-to-cluster it is stated that "--cov-mode 0" is "a sequence length overlap greater than X% of the longer of the two sequences."

Does "--cov-mode 0" normalized the coverage by the longer sequence or by the shorter sequence?

martin-steinegger commented 6 years ago

Thank you for the feedback! I have rewrote the cluster criteria section

(2) a minimum coverage (option -c [0,1], which is defined by the number of aligned residue pairs divided by either the maximum of the length of query/centre and target/non-centre sequences alnRes/max(qLen,tLen) (default mode, --cov-mode 0), or by the length of the target/non-centre sequence alnRes/tLen (--cov-mode 1), or by the length of the query/centre alnRes/qLen (--cov-mode 2);

ksteczk commented 6 years ago

And what about if I use clust independently when clustering using profiles? The following part of my workflow: making profiles

mmseqs search nr30_db nr70_db nr30_res tmp --num-iterations 3 --e-profile 0.001 -e 0.001 --threads 116
mmseqs result2profile nr30_db nr70_db nr30_res nr30_prof_db tmp --threads 110

and then searching

mmseqs search nr30_prof_db nr30_prof_db_consensus nr30_pp_res_0_05_cov tmp --threads 110 -e 0.05 --add-self-matches -c 0.8 --cov-mode 1
mmseqs clust nr30_prof_db nr30_pp_res_0_05_cov nr30_pp_clu_0_05_cov --threads 110 --cluster-mode 1 --similarity-type 1

produces erroneous huge cluster along with many others. But I suspect it connects too many entries into that one big cluster (encompassing 4.5M sequences out of 10M). So therefore my question is how to control cov-mode within clust as it has not explicit option. Would that be through the search - I provided respective options to the above search but that doesn't solve the big cluster problem...

Of course if the covering options applied to the search itself apply also to subsequent clustering it will be clear to me then that my data (in this case nr database) clusters transitively in too broad manner.

martin-steinegger commented 6 years ago

@ksteczk Using --cluster-mode 1 can result in this huge clusters since it adds all connected component into a single cluster.

You can not control the cov-mode in clust. Two ways are possible to filter results (1) you could filter single columns of the alignment result using filterdb but coverage is no explicit field in the alignment result therefor it is not possible. (2) You could write an awk script which computes coverage and use the apply tool to filter multiple columns. Apply is similar to map from MapReduce. It applies a function on each result entry and writes the results to a new database.