neo4j-contrib / neo4j-graph-algorithms

Efficient Graph Algorithms for Neo4j
https://github.com/neo4j/graph-data-science/
GNU General Public License v3.0
767 stars 196 forks source link

`0 based indexing` for community labeling? #922

Closed ghost closed 4 years ago

ghost commented 4 years ago

Is there a way to force the community #'s that get used in the results to be more consistent? See how I have the same communities below with different #'s assigned for different runs of the algorithm?

I'm scared to use seed labels param because I don't want to bias anything, especially with a max of 10 iterations.

Maybe there is already some meaning to these #'s I am ignorant of. But I would like them to be 0 indexed [0, 1, 2, 3] so I can iterate over them easily.

Different community #'s in different LPA runs on very similar graphs: image

AliciaFrame commented 4 years ago

Hi @HashRocketSyntax -- label propagation and Louvain modularity aren't deterministic, so you shouldn't expect the same results when you run them multiple times on the same (or especially if they're different graphs). If you're running a community detection algorithm on the same graph over and over, maybe with additional data, we've added seeding specifically so you can retain the same community IDs across different runs. For LPA, the only difference is that we initialize the algorithm with the seed property instead of randomly selected community IDs.

If your primary concern is determinism, we do support deterministic tie breaks in more recent releases of LPA, such that if the votes are equal, we default to the lower numbered community (instead of randomly selecting) - just make sure you're running on the most recent version of the library (currently 3.5.14). Hopefully this addresses your concerns about reproducibility.

We currently only support consecutive integer IDs as an option for weakly connected components (union find). Adding consecutive IDs for our other community detection algorithms is on our roadmap, but we don't have timelines just yet.

Note that on different graphs (even if they're similar) you shouldn't expect the same community assignments unless you're using seeds, and even then the seeds are used to influence the results, but the algorithms still identify the communities that exist in your data.

ghost commented 4 years ago

My primary concern is not determinism. I understand that, just like when running an ML/NN algorithm, I can get different results each time. But at least those algorithms predict classes 0, 1, 2.

My concern is running a graph alg on thousands of identical graphs where my communities are expected to be extremely (95%) similar, and then having to figure out which community was predicted after the fact.

Sometimes it's 4,5,6 or 81,82,83 or 18,19,20. Why can't the first community assigned be 0 then next be 1 and the next be 2?

ghost commented 4 years ago

In doing it myself, I can now recommend an easy fix. Let's say you have 88,89,90.

You just find the minimum community number assigned. Then you subtract it from every community assigned. 88-88 = 0, 89-88 = 1, 90-88 = 2.