vtraag / leidenalg

Implementation of the Leiden algorithm for various quality functions to be used with igraph in Python.
GNU General Public License v3.0
596 stars 78 forks source link

order of nodes affect the Leiden results? #71

Closed geowcz closed 2 years ago

geowcz commented 3 years ago

Recently I used the Leiden algorithm on a very large weighted network, when I used the "from_networkx", it could generate an expected result but this function was very slow to convert the network format, so I switched to "TupleList", it was faster but the Leiden algorithm can't generate the same result as I expected. I debugged the code and found the order of nodes affects the results. Can I ask how could I deal with this issue? I was trying to reorder the node of the graph but have not found the solutions, any thoughts or comments here?

vtraag commented 3 years ago

Yes, the order of the nodes affects the results. This is because the algorithm itself is stochastic in nature, and in fact, it randomizes the node ordering. Hence, if the graph has differently ordered nodes, the results might be different. If the partition is relatively clear, most of the changes are likely to happen on the boundaries between different clusters, although some splits/mergers of clusters can also happen. This is something inherent to the approach, which cannot be changed. Obviously, if you use the same node ordering, and the same random seed, you will get the same result, but that does not "solve" the issue.

I am not sure if I can provide any more useful feedback. If you have a specific question, I might provide more detailed answer. Perhaps you are constructing the graph in a different way than in using from_networkx? If you find the from_networkx to be too slow, the python-igraph developers (of which I am also one) would be happy to look into this. Perhaps you can open an issue there and report speed differences?

geowcz commented 3 years ago

Thank you for the comments. For the node order, Can I ask how could I ensure I can generate a reasonable result? When I used the algorithm after I posted the question, I had one or two times to get a very random partition result that can't reveal the "real" community structure of my network.

In terms of the network construction, when I used "from_networkx", "TupleList", and "creating a graph by adding vertices and edges" to load a network that is composed of around 33,000 nodes and 11.3 million edges, they took around 14.7 minutes, 0.82 seconds, and 0.14 seconds acoordingly. Also I found the "from_networkx" can load all my nodes no matter the nodes have edges or not, but "TupleList" can only load the nodes that have edges, and the third way can load all nodes in ascending order. I want to ask can the Leidenalg handle the nodes without any edges? My testing is yes but I just want to confirm it again.

Here are my codes: "from_networkx":

G1 = ig.Graph.from_networkx(G)
prepartition = la.find_partition(G1, la.RBConfigurationVertexPartition, None, G1.es["weight"], 20, 0, 1, resolution_parameter = inputResolution)
partition_dict = {}
for name, membership in zip(G1.vs["_nx_name"], prepartition.membership):
    partition_dict[int(name)] = int(membership)

"TupeList":

G1 = ig.Graph.TupleList(edges, directed=False, vertex_name_attr="name", edge_attrs= ["weight", "estTime"], weights= False)
prepartition = la.find_partition(G1, la.RBConfigurationVertexPartition, None, G1.es["weight"], 20, 0, 1, resolution_parameter = inputResolution)
for name, membership in zip(G1.vs["name"], prepartition.membership):
    partition_dict[int(name)] = int(membership)

"the third method":

G1 = ig.Graph(directed = False)
G1.add_vertices(list(set(G.nodes)))
G1.vs["name"] = list(set(G.nodes))
G1.add_edges([(x, y) for (x, y, z, w) in edges])
G1.es['weight'] = [z for (x, y, z, w) in edges]
G1.es['estTime'] = [w for (x, y, z, w) in edges]
prepartition = la.find_partition(G1, la.RBConfigurationVertexPartition, None, G1.es["weight"], 20, 0, 1, resolution_parameter = inputResolution)
for name, membership in zip(G1.vs["name"], prepartition.membership):
    partition_dict[int(name)] = int(membership)
vtraag commented 3 years ago

For the node order, Can I ask how could I ensure I can generate a reasonable result? When I used the algorithm after I posted the question, I had one or two times to get a very random partition result that can't reveal the "real" community structure of my network.

This is difficult to answer without any more details. It sounds a bit strange if you got completely bogus results in one run and quite clear results in another run. Perhaps something else went wrong with node indexing or something similar?

when I used "from_networkx", "TupleList", and "creating a graph by adding vertices and edges" to load a network that is composed of around 33,000 nodes and 11.3 million edges, they took around 14.7 minutes, 0.82 seconds, and 0.14 seconds acoordingly.

OK, this is a very substantial difference that should be addressed. Could you please open an issue for this upstream in python-igraph

can the Leidenalg handle the nodes without any edges?

Yes, it can, but of course nodes without any edges do not affect the result in any way

vtraag commented 3 years ago

when I used "from_networkx", "TupleList", and "creating a graph by adding vertices and edges" to load a network that is composed of around 33,000 nodes and 11.3 million edges, they took around 14.7 minutes, 0.82 seconds, and 0.14 seconds acoordingly.

OK, this is a very substantial difference that should be addressed. Could you please open an issue for this upstream in python-igraph

Actually, now that I am looking at your code in more detail, I see that you are using the edges variable directly when creating the graph based on TupleList. Perhaps most of the inefficiency actually stems from networkx, since it does not need to be accessed when you provide the edges variable directly.

geowcz commented 3 years ago

Great, thank you so much for your feedback, I will test it to see whether I will get some random results and then report it in this thread or a new one.

Can I ask if the nodes without any edges are added for partition by the Leiden algorithm, will the nodes be automatically assigned a partitioned index based on its index in the network? Say I have a network of nodes 1, 2, 3, 4,..., and 10. The nodes 4 and 6 do not have any edges, when I use the Leiden algorithm, it can detect four communities 0, 1, 2, and 3. Communities 0 and 1 contain nodes with edges while communities 2 and 3 contain nodes without edges. Will the partition look-alike this way? 1 -> 0 2 ->1 3 ->1 4 ->2 5 ->0 6 ->3 7 ->0 8 ->0 9 ->1 10 ->0

I agree with you. The edges variable was used after I found the "from_networkx" was very slow to handle such a large network and the "TupeList" seemed to work in a similar way. But the TupleList can't contain nodes without edges, so I used the third way. The slow efficiency perhaps comes from the mechanism of defining the "from_networkx" function. I will go to open the issue in python-igraph.

vtraag commented 2 years ago

Can I ask if the nodes without any edges are added for partition by the Leiden algorithm, will the nodes be automatically assigned a partitioned index based on its index in the network? Say I have a network of nodes 1, 2, 3, 4,..., and 10. The nodes 4 and 6 do not have any edges, when I use the Leiden algorithm, it can detect four communities 0, 1, 2, and 3. Communities 0 and 1 contain nodes with edges while communities 2 and 3 contain nodes without edges. Will the partition look-alike this way? 1 -> 0 2 ->1 3 ->1 4 ->2 5 ->0 6 ->3 7 ->0 8 ->0 9 ->1 10 ->0

I'm not 100% if I understand the question correctly. If the question is whether nodes without any edges will retain their initial membership, then no, that will not necessarily be the case. In general, the community label is based on the community size, where the largest community is labeled 0, the second-largest community is labeled 1, et cetera. Singleton communities will hence be the last, but the index among the various singleton communities may vary. Whether node 4 is labelled 2 or 3, or node 6 is labelled 2 or 3, is hence not sure.

I'll close this issue for now. If you have any other specific question, feel free to open a new issue.

geowcz commented 2 years ago

I see. Thank you for the explanation.