taynaud / python-louvain

Louvain Community Detection
BSD 3-Clause "New" or "Revised" License
965 stars 200 forks source link

taking into account the node removal cost #17

Closed altsoph closed 6 years ago

altsoph commented 7 years ago

In the function __one_level for each node we iterate over the possible communities. best_com and best_increase are computed to decide the best community for this node. However, best_increase is initialized with 0 which seems to be not correct since the removal of a node from its community may have positive or negative effect on modularity. We corrected this, according to the original paper by Blondel et al. "Fast unfolding of communities in large networks". On our tests it makes the relatively small difference but sometimes it's important.

taynaud commented 7 years ago

Hello,

Thanks for your contribution. I will take a look, but since it changes an important part of the computation I'll need to read original paper again and it may take some time before merging it.

Best,

altsoph commented 7 years ago

Sure. Feel free to ask anything, what could help.

Best regards

regstrtn commented 6 years ago

Hi, This had bothered me too, but I arrived at the conclusion that the decision taken will still be correct. Do correct me if I am wrong. While looking for the best community to move the current node to, __one_level function examines all the neighbouring communities. If moving the node to any of these communities results in the most +ve increase in modularity, the node will be moved to that community. Two points to note here:

Be grateful to anyone who can clear this confusion up.

altsoph commented 6 years ago

Firstly I thought about it the same way as you do. But there are two points which make the difference, I believe.

1) While checking all the neighbouring communities as a candidate for addition we do not necessary check the original community of our node (since we have "if neighbor != node:" condition inside the __neighcom function, so it the community has only one node, it will not be in the "neigh_communities" dictionary). However, it doesn't seem to be a big problem right now. 2) We have the default values best_com = com_node, best_increase = 0 and we look only for the increasing values, starting from zero.

So I still think it is possible to have an unexpected situation: Since the increase in modularity technically could be seen as a sum of the removal cost and the addition cost and the removal cost is the same for all community candidates, the current code checks only addition costs. But now it looks for the best positive addition cost.

Let's assume we have two community candidates for our node: our current community C1 and C2 (which is better). Now let's say removing our node from C1 gives us +10 modularity, putting it back into C1 -10 modularity and putting it into C2 gives us -5 modularity. All together it's optimal to move our node from C1 to C2 and take +10-5=+5 modularity (as staying in C1 gives us 0 change). But since current code checks only positive addition costs options, and there are no such ones, we will end back in C1.

The one possible solution for this potentional problem could be initialization of best_increase not with 0 value but with the actual addtion_cost for current cluster (which could be negative without taking into account the removal_cost). I even think such solution will be more elegant than the one proposed in this pull-request. But I believe the proposed one is more clear and useful for future debug of such situations :)

regstrtn commented 6 years ago

Oh! Got what you are saying. This change does make the code more general and probably closer to what the original algorithm intended.

taynaud commented 6 years ago

Hi, your arguments are sound. You say that sometimes the change is important, do you have some concrete examples (a bit more concrete than C1/C2 example) ?

altsoph commented 6 years ago

Well, I didn't have any concrete example yet. But as you asked for, I made a simple random bruteforce search for differences and it gives, for example, this one: graph = [(0, 9), (0, 2), (1, 9), (1, 10), (1, 12), (2, 8), (2, 3), (2, 7), (3, 5), (4, 11), (5, 16), (5, 6), (5, 7), (7, 8), (7, 14), (8, 12), (8, 10), (9, 14), (10, 16), (11, 13), (13, 16), (14, 16), (15, 16)] The current code gives such solution: [0, 1, 0, 0, 2, 3, 3, 0, 1, 0, 1, 2, 1, 2, 0, 3, 3] with the modularity 0.362003780718 The patched code gives this solution: [0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 1, 2, 1, 2, 0, 2, 2] with the modularity 0.363894139887

To be honest, there are both situations when the current code gives better or worse solutions than the patched code. But the point is (1) they are different and (2) the provided patch is closer to the paper :)

taynaud commented 6 years ago

Hello,

Thanks you very much. I merge your code. I will do a release with your change in one week after my holydays.

Best

altsoph commented 6 years ago

Great, thanks!