vtraag / leidenalg

Implementation of the Leiden algorithm for various quality functions to be used with igraph in Python.
GNU General Public License v3.0
575 stars 77 forks source link

Incremental partitioning #127

Closed aigora-de closed 1 year ago

aigora-de commented 1 year ago

Hi Vincent, not so much an issue as a request for help or advice about the correct use of leidenalg.

The example python code pasted below shows what I'm trying to do; I have a graph g which I do an initial partitioning from scratch. From time to time, new nodes and edges are added (show here as g_1). I want to update the clustering using the new nodes and edges, but I'd like to preserve (most of) the original partitioning. I'm using leidenalg.ModularityVertexPartition.

In the first place, I'm setting the RNG seed to 42 and you can see in my example that I get deterministic outputs for the same input (graph g).

I then update g to g_1 with some new nodes and edges and try three cases:

  1. Partition g_1 without providing an initial_membership parameter.
  2. Partition g_1 with an initial_membership from the original partitioning on g.
  3. Partition g_1 using a fixed membership for the nodes/clusters from the original nodes / clustering on g

For my use case, I'd like the community ID to persist across updates and I only want to update the communities / nodes that have changed, or been added. Otherwise, I may as well do a completely new partitioning of the graph after each update.

In my example code, the function check_communities() prints a mapping of common and changed ocmmunities. I could use this to update the changed communities with their new IDs, but this is inefficient for my real world graphs which have hundreds of thousands of nodes.

There isn't much difference between case 1 and case 2 above (without and with initial_membership respectively), which begs the question, 'why provide initial_membership if it's mostly ignored?'

Case 3 does much better at preserving the community IDs, but at the expense of also preserving the singletons. I suppose I could fix communities of a certain minimum soze, but what arbitrary size?

In conclusion, I think I must be approaching this in the wrong way, as I can't believe others haven't had this problem and solved it. Should I be using a layered graph?

Many thanks for any help or advice you can offer.

Best wishes

Dave


import igraph
import random
import leidenalg as la

# Set the RNG seed to a fixed value
rng_seed = 42
random.seed(rng_seed)

def check_communities(initial_membership, new_membership):
    common_communities = {}
    c1 = set(initial_membership)
    for c in c1:
        cm_1 = [i for i, x in enumerate(initial_membership) if x == c]
        cm_2 = [new_membership[x] for x in cm_1]
        if len(set(cm_2)) == 1:
            l3 = [i for i, x in enumerate(new_membership) if x == cm_2[0]]
            if cm_1 == l3:
                common_communities[c] = list(set(cm_2))[0]

    print(f"Common communities between `g` and `g_1`: {common_communities}")
    changed_communities = c1.difference(common_communities.keys())
    print(f"Changed communities between `g` and `g_1`: {changed_communities}")

def fixed_membership_partition(g, initial_membership):
    _communities = la.ModularityVertexPartition(
        graph=g,
        initial_membership=initial_membership,
    )
    # Get the max community_id, which is the temporary ID assigned to new nodes
    max_community_id = max(initial_membership)
    # Fix community_id for nodes in initial_membership
    fixed_membership = [x for x in initial_membership if x < max_community_id]
    # Build a list of community IDs that should be fixed (next two lines)
    len_fixed_membership = len(fixed_membership)
    is_membership_fixed = [i < len_fixed_membership for i in range(g_1.vcount())]
    # Keeping old communities fixed, allocate new nodes to communities (next 11 lines)
    optimiser = la.Optimiser()
    optimiser.set_rng_seed(rng_seed)
    # diff measures the improvement in the quality function. When diff == 0, no further improvement can be made.
    diff = 1
    while diff > 0:
        # iterate the optimiser until no improvements can be made
        diff = optimiser.optimise_partition(
            _communities,
            is_membership_fixed=is_membership_fixed,
        )
    return _communities

# Create an empty graph with 257 vertices
g = igraph.Graph(n=257, directed=False)
# Add random edges to the graph
for i in range(257):
    for j in range(i + 1, 257):
        if random.random() < 0.01:
            g.add_edge(i, j)
# Print the graph summary
print(f"graph `g`: {g.summary()}")

communities_1 = la.find_partition(g, partition_type=la.ModularityVertexPartition, seed=rng_seed)
communities_2 = la.find_partition(g, partition_type=la.ModularityVertexPartition, seed=rng_seed)
print(communities_1)
print(f"original graph, g:")
print(f"communities_1 == communities_2: {communities_1.membership == communities_2.membership}")

# Make a copy of the original graph
g_1 = g.copy()
# Add 5 new nodes to the end of the vertex list
for i in range(257, 262):
    g_1.add_vertex()
# Connect the new nodes randomly across the whole extended graph
for i in range(257, 262):
    for j in range(i):
        if j < 257:
            p = 0.1  # Probability of connecting to an old node
        else:
            p = 0.3  # Probability of connecting to another new node
        if random.random() < p:
            g_1.add_edge(i, j)
# Print the extended graph summary
print(f"graph `g_1`: {g_1.summary()}")

next_comm_id = max(communities_1.membership)
extent = g_1.vcount() - g.vcount()
initial_membership = communities_1.membership
initial_membership.extend([next_comm_id] * extent)

# First, partition with no hint from initial_membership
communities_3 = la.find_partition(g_1, partition_type=la.ModularityVertexPartition, seed=rng_seed)
communities_4 = la.find_partition(g_1, partition_type=la.ModularityVertexPartition, seed=rng_seed)
print(communities_3)
print(f"updated graph, g_1 (without `initial_membership`):")
print(f"communities_3 == communities_4: {communities_3.membership == communities_4.membership}")

new_membership = communities_3.membership
check_communities(initial_membership, new_membership)

# Now partition with initial_membership
communities_5 = la.find_partition(g_1, partition_type=la.ModularityVertexPartition,
                                  initial_membership=initial_membership, seed=rng_seed)
communities_6 = la.find_partition(g_1, partition_type=la.ModularityVertexPartition,
                                  initial_membership=initial_membership, seed=rng_seed)
print(communities_5)
print(f"updated graph, g_1 (with `initial_membership`):")
print(f"communities_5 == communities_6: {communities_5.membership == communities_6.membership}")

new_membership = communities_5.membership
check_communities(initial_membership, new_membership)

# Partition with fixed membership
communities_7 = fixed_membership_partition(g_1, initial_membership=initial_membership)
communities_8 = fixed_membership_partition(g_1, initial_membership=initial_membership)
print(communities_7)
print(f"updated graph, g_1 (with fixed membership):")
print(f"communities_7 == communities_8: {communities_7.membership == communities_8.membership}")
new_membership = communities_7.membership
check_communities(initial_membership, new_membership)
aigora-de commented 1 year ago

Hi Vincent,

Sorry for not responding here sooner, but following our chat the solution became clear!

In my original post, case 3 fixed nodes already allocated to a community in the first run, leaving new nodes free to be partitioned as appropriate. (Obvious in hindsight, but) the solution was to leave new nodes free (as before), and additionally "unfix" old nodes iff they now have a relationship with one or more new nodes.

Since new nodes and their relationships are typically a very small proportion of the total graph, this solution is working well.

Again, many thanks for taking the time to consider my request for help.

vtraag commented 1 year ago

Great to see that the proposed solution works well for your case!