Closed dirkjanvw closed 6 months ago
Hi, your observation can be explained. Perhaps it points to something that requires better diagnostics from mcl.
▷ mcl A.txt --abc -o - -I 8.4
[mcl] new tab created
[mcl] pid 21685
ite ---------- chaos time hom(avg,lo,hi) m-ie m-ex i-ex fmv
1 .......... 0.49 0.00 0.97/0.78/1.05 0.99 0.99 0.99 100
2 .......... 0.24 0.00 0.79/0.73/0.95 0.99 0.89 0.89 100
3 .......... 0.02 0.00 1.00/1.00/1.02 0.99 0.22 0.20 100
4 .......... 0.02 0.00 1.00/1.00/1.02 0.95 0.95 0.20 100
5 .......... 0.03 0.00 1.01/1.00/1.03 0.95 0.95 0.20 100
6 .......... 0.07 0.00 1.01/1.00/1.05 0.95 0.95 0.20 100
7 .......... 0.22 0.00 1.00/0.98/1.00 0.95 0.95 0.20 100
8 .......... 0.01 0.00 1.00/0.99/1.00 0.95 0.95 0.20 100
9 .......... 0.00 0.00 1.00/1.00/1.00 0.95 0.86 0.18 100
[mcl] cut <8> instances of overlap
[mcl] jury pruning marks: <100,100,100>, out of 100
[mcl] jury pruning synopsis: <100.0 or really really really good> (cf -scheme, -do log)
1236 84764 14234 47276 155554 30514 68751 137873 52557
147494
Note the line that says: [mcl] cut <8> instances of overlap. It is possible to specify keeping overlap:
mcl A.txt --abc -o - -I 8.4 -overlap keep
[mcl] new tab created
[mcl] pid 23556
ite ---------- chaos time hom(avg,lo,hi) m-ie m-ex i-ex fmv
1 .......... 0.49 0.00 0.97/0.78/1.05 0.99 0.99 0.99 100
2 .......... 0.24 0.00 0.79/0.73/0.95 0.99 0.89 0.89 100
3 .......... 0.02 0.00 1.00/1.00/1.02 0.99 0.22 0.20 100
4 .......... 0.02 0.00 1.00/1.00/1.02 0.95 0.95 0.20 100
5 .......... 0.03 0.00 1.01/1.00/1.03 0.95 0.95 0.20 100
6 .......... 0.07 0.00 1.01/1.00/1.05 0.95 0.95 0.20 100
7 .......... 0.22 0.00 1.00/0.98/1.00 0.95 0.95 0.20 100
8 .......... 0.01 0.00 1.00/0.99/1.00 0.95 0.95 0.20 100
9 .......... 0.00 0.00 1.00/1.00/1.00 0.95 0.86 0.18 100
[mcl] kept <8> instances of overlap
[mcl] jury pruning marks: <100,100,100>, out of 100
[mcl] jury pruning synopsis: <100.0 or really really really good> (cf -scheme, -do log)
1236 84764 14234 47276 155554 30514 68751 137873 52557
1236 84764 14234 47276 147494 30514 68751 137873 52557
The difference you see is due to the fact that labels are handed indexes in the order they come in; if overlap is cut then the first cluster gets all the overlap, and the second cluster has all the overlap removed. The order dependency comes in here. You can see that the nodes 155554 and 147494 are exactly the two nodes that are not in overlap.
Overlap is quite rare. I did not create good diagnostics for it - your experience indicates this should perhaps be done. I am curious, what is your application area?
This is a good place to add that mcl clusterings may also induce sub-graphs that are not connected in graphs that have (locally) bi-partite characteristics. By default this is left in place should it happen, but mcl provides options
-force-connected y/n analyze clustering, make sure it induces connected components
-check-connected y/n analyze clustering, see whether it induces connected components
See also here -- this example is from a STRING database network; I have also (long ago) seen networks with locally bipartite characteristics in orthology networks where the best reciprocal hit relationship can have surprising characteristics.
Thanks for the quick reply! We use MCL in our software PanTools for homology grouping of proteins (https://doi.org/10.1186/s12859-018-2362-4) and only recently discovered this issue (but it only occurs in a very small number of cases). I think I understand the issue here, we will dive into our code to see how this can occur. The suggested parameters might be very helpful! I will let you know if we have more questions about it.
I forgot to add the following: Overlap is conjectured to happen only if the input graph has a connected component that is highly symmetric (I strongly believe in this conjecture). This is the reason for stating above that overlap is rare - in most data sciences the input is slightly noisy and/or unevenly sampled and does not have nice symmetries to begin with. The input graph above is fully connected, with a similarity distribution that makes nodes 155554 and 147494 indistinguishable:
155554 edges | 147494 edges
--------------------------------+------------------------------
1236 155554 590941952.65 | 1236 147494 590941952.65
14234 155554 543087837.45 | 14234 147494 543087837.45
30514 155554 590941952.65 | 30514 147494 590941952.65
47276 155554 580015194.63 | 47276 147494 580015194.63
52557 155554 580015194.63 | 52557 147494 580015194.63
68751 155554 580015194.63 | 68751 147494 580015194.63
84764 155554 580015194.63 | 84764 147494 580015194.63
137873 155554 5698635.25 | 137873 147494 5698635.25
147494 155554 244140625.00 | 147494 155554 244140625.00
In the above table you you can see that they have the exact same connection weight for all other nodes. This in itself is not sufficient to create a case of overlap, as generally nodes like this will end up in the same cluster. In this case the weight structure of the edges is such that the mcl process separates (for that particular inflation parameter) the two nodes into different (overlapping) components. This is not necessarily a meaningful separation; inflation 8.4 is very high, so it is a tiny graph exposed to violent forces and it may be the result of a quite specific constellation of similarities.
Just comment if there are further considerations.
Hi, thanks for MCL, we have been using it for years! However, recently I discovered something weird: when I change the order of the input (leaving the similarity and connected nodes identical; thus, an identical graph), I get different output.
An example graph A:
And an example graph B:
I checked that both are identical using
awk '$1<$2{print $1,$2,$3; next;} {print $2,$1,$3;}' ${graph} | sort
. However, these are the results when running MCL (mcl ${graph} --abc -I 8.4 -o ${output}
):Output for graph A:
Whereas the output for graph B is:
Is this expected? I was under the impression that the ABC format specified an undirected graph and thus should have identical output when changing the order of the input.
Edit: I forgot to mention the
sort
in theawk
command above.