soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.36k stars 190 forks source link

Feature request: ordinal cluster names #405

Open joelb123 opened 3 years ago

joelb123 commented 3 years ago

Currently clusters are named for the gene ID selected as representative of the set. I suspect that for the overwhelming majority of uses, these gene IDs are pretty meaningless, with maybe a code of the organism, chromosome, and position number.

A more meaningful naming scheme is using an ordinal on cluster size with the largest cluster always being number 0. I would suggest the ordering be secondarily on total sequence length in the cluster. Using this scheme saves a lookup of what the biggest clusters are.

I further suggest that the cluster file be sorted by this ordering with the biggest cluster first.

This change may break some existing downstream scripts, so it should probably be optional.

milot-mirdita commented 3 years ago

Could you make a small example with like three clusters and a few members each? I am not sure I really understand this. How would you then identify which ordinal identifier would belong to which sequence?

acpguedes commented 3 years ago

Hi, I do these ordinal names in Python when I run a pipeline that uses MMSeqs2 as a backend. You need to rank the clusters by size, and top-down number it.

This output:

a   a
a   b
a   c
a   d
e   e
e   f
e   h
i   i
i   j

might become something like this:

0   a
0   b
0   c
0   d
1   e
1   f
1   h
2   i
2   j

I hope I help someway.

joelb123 commented 3 years ago

Sorry for the delay in responding. OK, say the cluster file looks is this:

cluster_id gene_id seq_len gene_10234 gene_10234 328 gene_10234 gene_91962 328 gene_00128 gene_00128 146 gene_00128 gene_21573 150 gene_00128 gene_46293 152 gene_37823 gene_37823 180 gene_37823 gene_24298 181 gene_28462 gene_47247 212 gene_47527 gene_36728 142 This file specifies 5 clusters. The largest cluster is named "gene_00128", and as the only cluster of size 3, it will get numbered as cluster 0. There are two clusters of size 2 named "gene_10234" containing 656 characters, and "gene_37823" containing 361 characters, so by the secondary sort key of total characters, cluster "gene_10234" gets numbered as cluster 1 and cluster "gene_37823" is cluster 2. For the remaining singleton clusters, "gene_28462" will become cluster 3 and "gene_47527" becomes cluster 4.

As a table-driven algorithm, do a groupby on cluster_id, calculate cluster size and total sequence length, then use them as primary and secondary sort keys, both in descending order. A groupby on the sorted array becomes the ordinal cluster number.