micans / mcl

MCL, the Markov Cluster algorithm, also known as Markov Clustering, is a method and program for clustering weighted or simple networks, a.k.a. graphs.
https://micans.org/mcl
Other
86 stars 12 forks source link

Inconsistent result with identical graphs #28

Closed dirkjanvw closed 6 months ago

dirkjanvw commented 6 months ago

Hi, thanks for MCL, we have been using it for years! However, recently I discovered something weird: when I change the order of the input (leaving the similarity and connected nodes identical; thus, an identical graph), I get different output.

An example graph A:

1236 84764 252155497.24
1236 14234 221358383.20
1236 47276 252155497.24
1236 155554 590941952.65
1236 147494 590941952.65
1236 30514 244140625.00
1236 68751 252155497.24
1236 137873 27118504.64
1236 52557 252155497.24
14234 52557 228746602.32
14234 155554 543087837.45
14234 137873 27118504.64
14234 68751 228746602.32
14234 147494 543087837.45
14234 47276 228746602.32
14234 30514 221358383.20
14234 84764 228746602.32
30514 84764 252155497.24
30514 155554 590941952.65
30514 47276 252155497.24
30514 137873 27118504.64
30514 147494 590941952.65
30514 52557 252155497.24
30514 68751 252155497.24
47276 155554 580015194.63
47276 84764 244140625.00
47276 147494 580015194.63
47276 52557 244140625.00
47276 137873 26284829.57
47276 68751 244140625.00
52557 68751 244140625.00
52557 147494 580015194.63
52557 137873 26284829.57
52557 84764 244140625.00
52557 155554 580015194.63
68751 137873 26284829.57
68751 147494 580015194.63
68751 155554 580015194.63
68751 84764 244140625.00
84764 147494 580015194.63
84764 137873 26284829.57
84764 155554 580015194.63
137873 147494 5698635.25
137873 155554 5698635.25
147494 155554 244140625.00

And an example graph B:

1236 137873 27118504.64
1236 14234 221358383.20
1236 147494 590941952.65
1236 155554 590941952.65
1236 30514 244140625.00
1236 47276 252155497.24
1236 52557 252155497.24
1236 68751 252155497.24
1236 84764 252155497.24
137873 147494 5698635.25
137873 155554 5698635.25
14234 137873 27118504.64
14234 147494 543087837.45
14234 155554 543087837.45
14234 30514 221358383.20
14234 47276 228746602.32
14234 52557 228746602.32
14234 68751 228746602.32
14234 84764 228746602.32
147494 155554 244140625.00
30514 137873 27118504.64
30514 147494 590941952.65
30514 155554 590941952.65
30514 47276 252155497.24
30514 52557 252155497.24
30514 68751 252155497.24
30514 84764 252155497.24
47276 137873 26284829.57
47276 147494 580015194.63
47276 155554 580015194.63
47276 52557 244140625.00
47276 68751 244140625.00
47276 84764 244140625.00
52557 137873 26284829.57
52557 147494 580015194.63
52557 155554 580015194.63
52557 68751 244140625.00
52557 84764 244140625.00
68751 137873 26284829.57
68751 147494 580015194.63
68751 155554 580015194.63
68751 84764 244140625.00
84764 137873 26284829.57
84764 147494 580015194.63
84764 155554 580015194.63

I checked that both are identical using awk '$1<$2{print $1,$2,$3; next;} {print $2,$1,$3;}' ${graph} | sort. However, these are the results when running MCL (mcl ${graph} --abc -I 8.4 -o ${output}):

Output for graph A:

1236    84764   14234   47276   155554  30514   68751   137873  52557
147494

Whereas the output for graph B is:

1236    137873  14234   147494  30514   47276   52557   68751   84764
155554

Is this expected? I was under the impression that the ABC format specified an undirected graph and thus should have identical output when changing the order of the input.

Edit: I forgot to mention the sort in the awk command above.

micans commented 6 months ago

Hi, your observation can be explained. Perhaps it points to something that requires better diagnostics from mcl.

 ▷  mcl A.txt --abc -o - -I 8.4
[mcl] new tab created
[mcl] pid 21685
 ite ----------  chaos  time hom(avg,lo,hi) m-ie m-ex i-ex fmv
  1  ..........   0.49  0.00 0.97/0.78/1.05 0.99 0.99 0.99 100
  2  ..........   0.24  0.00 0.79/0.73/0.95 0.99 0.89 0.89 100
  3  ..........   0.02  0.00 1.00/1.00/1.02 0.99 0.22 0.20 100
  4  ..........   0.02  0.00 1.00/1.00/1.02 0.95 0.95 0.20 100
  5  ..........   0.03  0.00 1.01/1.00/1.03 0.95 0.95 0.20 100
  6  ..........   0.07  0.00 1.01/1.00/1.05 0.95 0.95 0.20 100
  7  ..........   0.22  0.00 1.00/0.98/1.00 0.95 0.95 0.20 100
  8  ..........   0.01  0.00 1.00/0.99/1.00 0.95 0.95 0.20 100
  9  ..........   0.00  0.00 1.00/1.00/1.00 0.95 0.86 0.18 100
[mcl] cut <8> instances of overlap
[mcl] jury pruning marks: <100,100,100>, out of 100
[mcl] jury pruning synopsis: <100.0 or really really really good> (cf -scheme, -do log)
1236    84764   14234   47276   155554  30514   68751   137873  52557
147494

Note the line that says: [mcl] cut <8> instances of overlap. It is possible to specify keeping overlap:

mcl A.txt --abc -o - -I 8.4 -overlap keep
[mcl] new tab created
[mcl] pid 23556
 ite ----------  chaos  time hom(avg,lo,hi) m-ie m-ex i-ex fmv
  1  ..........   0.49  0.00 0.97/0.78/1.05 0.99 0.99 0.99 100
  2  ..........   0.24  0.00 0.79/0.73/0.95 0.99 0.89 0.89 100
  3  ..........   0.02  0.00 1.00/1.00/1.02 0.99 0.22 0.20 100
  4  ..........   0.02  0.00 1.00/1.00/1.02 0.95 0.95 0.20 100
  5  ..........   0.03  0.00 1.01/1.00/1.03 0.95 0.95 0.20 100
  6  ..........   0.07  0.00 1.01/1.00/1.05 0.95 0.95 0.20 100
  7  ..........   0.22  0.00 1.00/0.98/1.00 0.95 0.95 0.20 100
  8  ..........   0.01  0.00 1.00/0.99/1.00 0.95 0.95 0.20 100
  9  ..........   0.00  0.00 1.00/1.00/1.00 0.95 0.86 0.18 100
[mcl] kept <8> instances of overlap
[mcl] jury pruning marks: <100,100,100>, out of 100
[mcl] jury pruning synopsis: <100.0 or really really really good> (cf -scheme, -do log)
1236    84764   14234   47276   155554  30514   68751   137873  52557
1236    84764   14234   47276   147494  30514   68751   137873  52557

The difference you see is due to the fact that labels are handed indexes in the order they come in; if overlap is cut then the first cluster gets all the overlap, and the second cluster has all the overlap removed. The order dependency comes in here. You can see that the nodes 155554 and 147494 are exactly the two nodes that are not in overlap.

Overlap is quite rare. I did not create good diagnostics for it - your experience indicates this should perhaps be done. I am curious, what is your application area?

This is a good place to add that mcl clusterings may also induce sub-graphs that are not connected in graphs that have (locally) bi-partite characteristics. By default this is left in place should it happen, but mcl provides options

-force-connected y/n    analyze clustering, make sure it induces connected components
-check-connected y/n    analyze clustering, see whether it induces connected components

See also here -- this example is from a STRING database network; I have also (long ago) seen networks with locally bipartite characteristics in orthology networks where the best reciprocal hit relationship can have surprising characteristics.

dirkjanvw commented 6 months ago

Thanks for the quick reply! We use MCL in our software PanTools for homology grouping of proteins (https://doi.org/10.1186/s12859-018-2362-4) and only recently discovered this issue (but it only occurs in a very small number of cases). I think I understand the issue here, we will dive into our code to see how this can occur. The suggested parameters might be very helpful! I will let you know if we have more questions about it.

micans commented 6 months ago

I forgot to add the following: Overlap is conjectured to happen only if the input graph has a connected component that is highly symmetric (I strongly believe in this conjecture). This is the reason for stating above that overlap is rare - in most data sciences the input is slightly noisy and/or unevenly sampled and does not have nice symmetries to begin with. The input graph above is fully connected, with a similarity distribution that makes nodes 155554 and 147494 indistinguishable:

155554 edges                    |   147494 edges
--------------------------------+------------------------------
1236     155554   590941952.65  |  1236    147494  590941952.65
14234    155554   543087837.45  |  14234   147494  543087837.45
30514    155554   590941952.65  |  30514   147494  590941952.65
47276    155554   580015194.63  |  47276   147494  580015194.63
52557    155554   580015194.63  |  52557   147494  580015194.63
68751    155554   580015194.63  |  68751   147494  580015194.63
84764    155554   580015194.63  |  84764   147494  580015194.63
137873   155554     5698635.25  |  137873  147494    5698635.25
147494   155554   244140625.00  |  147494  155554  244140625.00

In the above table you you can see that they have the exact same connection weight for all other nodes. This in itself is not sufficient to create a case of overlap, as generally nodes like this will end up in the same cluster. In this case the weight structure of the edges is such that the mcl process separates (for that particular inflation parameter) the two nodes into different (overlapping) components. This is not necessarily a meaningful separation; inflation 8.4 is very high, so it is a tiny graph exposed to violent forces and it may be the result of a quite specific constellation of similarities.

micans commented 6 months ago

Just comment if there are further considerations.