rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.77k stars 304 forks source link

More repeatability questions for cugraph.leiden #4373

Closed ChuckHastings closed 6 months ago

ChuckHastings commented 7 months ago

Follow up from issue #4072.

          @ChuckHastings, thank you for working on this! I've just tested with version `24.06.00a43`. The issue is resolved completely when the adjacency matrix is `float64`, but there is still some variability, although much less, when it's `float32` and the graph is large (75,000 nodes). I'm attaching the zipped [npz file](https://github.com/rapidsai/cugraph/files/15093902/adjacency_75000.zip) of the adjacency matrix. Here are the results I'm getting from 10 runs, as before:
Min. modularity (float32): 0.8989160060882568
Mean modularity (float32): 0.8994227409362793
Max. modularity (float32): 0.9014488458633423

Min. partition count (float32): 67
Max. partition count (float32): 68

[[1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
 [         nan 1.0000000000 0.8200713715 1.0000000000 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
 [         nan          nan 0.8200713715 1.0000000000 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
 [         nan          nan          nan 0.8200713715 0.8200713715 0.8200713715 1.0000000000 0.8200713715 0.8200713715]
 [         nan          nan          nan          nan 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
 [         nan          nan          nan          nan          nan 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
 [         nan          nan          nan          nan          nan          nan 0.8200713715 1.0000000000 1.0000000000]
 [         nan          nan          nan          nan          nan          nan          nan 0.8200713715 0.8200713715]
 [         nan          nan          nan          nan          nan          nan          nan          nan 1.0000000000]]

Min. adjusted Rand index (float32): 0.8200713715385887
Mean adjusted Rand index (float32): 0.9360253765470538
Max. adjusted Rand index (float32): 1.0

Since this is probably a different problem than the dendrogram flattening that you fixed, should I open a new issue? Or can you reopen this one?

Originally posted by @jpintar in https://github.com/rapidsai/cugraph/issues/4072#issuecomment-2074822898

ChuckHastings commented 7 months ago

Quick first observation of this. I have run this several times. For 64-bit I am always seeing everything align (as you indicate in your post). For 32-bit I am seeing some variability. Sometimes I get matching answers, sometimes I get differences. So this behavior (in 32-bit) is still non-deterministic.

I wonder if we are seeing floating point rounding errors causing different choices in the clustering decisions. I will attempt to isolate why we are seeing differences.

ChuckHastings commented 7 months ago

I have validated that there are issues related to numerical instability that are causing repeatability issues for 32-bit weights.

I would note that if your graph is large enough you'll probably see these in 64-bit weights as well, although your graph would need to be pretty large, I would think.

Our implementation of Leiden/Louvain executes computations in parallel, there is no guarantee that the order of additions of floating point numbers will be consistent from call to call. If we have a cluster weight on the order of 1e-5, and we have an edge weight on the order of 1e-11, the order in which the additions occur will have subtle changes to the result. As an example that I saw in this data:

  1. We computed the delta modularity for vertex v to move into cluster c1 as 5.08771e-5, and the delta modularity to move into cluster c2 was 5.08771e-5 (with c1 < c2). A tie (the difference between them was exactly 0). Our tie breaking procedure is to pick the smaller cluster id, so we put the vertex v into cluster c1.
  2. In a separate the computation of delta modularity for c2 we still get 5.08771e-5, but for c1 we get 5.08770e-5. The additions must have included one of the larger values before some of the much smaller values, and because we were dealing with a 32-bit float we lost the addition of the smaller values. In this case, delta modularity for c2 is larger than c1, so we decide to move the vertex v into cluster c2.

This results in a change in the entire sequence of decisions for the greedy Louvain and Leiden, so we end up with a different (sometimes radically different) clustering choice... but pretty good modularity score.

We can look into some changes to address this. But this would definitely be a longer term research problem.

I would suggest that if repeatability is a priority for you, using the 64-bit double should dramatically reduce the change of instability in the computation.

jpintar commented 7 months ago

Thank you for looking into this!