Open joelb123 opened 3 years ago
Could you make a small example with like three clusters and a few members each? I am not sure I really understand this. How would you then identify which ordinal identifier would belong to which sequence?
Hi, I do these ordinal names in Python when I run a pipeline that uses MMSeqs2 as a backend. You need to rank the clusters by size, and top-down number it.
This output:
a a
a b
a c
a d
e e
e f
e h
i i
i j
might become something like this:
0 a
0 b
0 c
0 d
1 e
1 f
1 h
2 i
2 j
I hope I help someway.
Sorry for the delay in responding. OK, say the cluster file looks is this:
cluster_id gene_id seq_len gene_10234 gene_10234 328 gene_10234 gene_91962 328 gene_00128 gene_00128 146 gene_00128 gene_21573 150 gene_00128 gene_46293 152 gene_37823 gene_37823 180 gene_37823 gene_24298 181 gene_28462 gene_47247 212 gene_47527 gene_36728 142
This file specifies 5 clusters. The largest cluster is named "gene_00128", and as the only cluster of size 3, it will get numbered as cluster 0. There are two clusters of size 2 named "gene_10234" containing 656 characters, and "gene_37823" containing 361 characters, so by the secondary sort key of total characters, cluster "gene_10234" gets numbered as cluster 1 and cluster "gene_37823" is cluster 2. For the remaining singleton clusters, "gene_28462" will become cluster 3 and "gene_47527" becomes cluster 4.
As a table-driven algorithm, do a groupby on cluster_id, calculate cluster size and total sequence length, then use them as primary and secondary sort keys, both in descending order. A groupby on the sorted array becomes the ordinal cluster number.
Currently clusters are named for the gene ID selected as representative of the set. I suspect that for the overwhelming majority of uses, these gene IDs are pretty meaningless, with maybe a code of the organism, chromosome, and position number.
A more meaningful naming scheme is using an ordinal on cluster size with the largest cluster always being number 0. I would suggest the ordering be secondarily on total sequence length in the cluster. Using this scheme saves a lookup of what the biggest clusters are.
I further suggest that the cluster file be sorted by this ordering with the biggest cluster first.
This change may break some existing downstream scripts, so it should probably be optional.