nicolasdugue / DirectedLouvain

GNU General Public License v2.0
38 stars 17 forks source link

Self-loops not considered in d^in (absolute modularity is off) #11

Open Splines opened 1 year ago

Splines commented 1 year ago

Hello Nicolas Dugué and Anthony Perez,

first of all thanks for making this great project open source, it really helps to understand the inner workings of the Louvain algorithm.


I just tried to put in one undirected graph into the C++ Louvain implementation by Blondel et al. (see the "newversion" tab on the left side of that website) and then the same graph into your directed Louvain algorithm. However, the initial modularity value $Q$ differs between the original Louvain and your directed Louvain implementation. While this may not be a big problem as the greedy optimization only needs to compare relative modularity values, it still might have some unwanted side effects.

In order explain what I mean exactly, let's consider the following graph I randomly came up with where the blue numbers denote the weights for the undirected edges.

Reproducing the issue

With the original Louvain code We can represent the graph using this `graph.txt` file for the original Louvain algorithm: ``` 0 0 3.0 0 1 1.0 1 2 5.0 2 3 2.5 3 1 7.0 3 3 1.0 ``` Download the [source code](https://sites.google.com/site/findcommunities/newversion) and put the `graph.txt` file in the root of the project. Open the file `community.cpp` and inside the `double Community::modularity()` method, add a printout before the `return q` statement like so: ```c++ std::cout << q << std::endl; ``` Then run: ``` make ./convert -i ./graph.txt -o ./graph.bin -w graph.weights ./community ./graph.bin -w graph.weights -v ``` You should observe this modularity value in the console: **`-0.172653`**.
With the DirectedLouvain code Next, we will do the same thing with your directed Louvain algorithm. ``` git clone https://github.com/nicolasdugue/DirectedLouvain.git cd ./DirectedLouvain ``` I manually converted the undirected graph to a directed graph so that DirectedLouvain can handle it like so: ``` 0 0 3.0 0 1 1.0 1 0 1.0 1 2 5.0 2 1 5.0 2 3 2.5 3 2 2.5 3 1 7.0 1 3 7.0 3 3 1.0 ``` Put this into the file `graph/graph.txt`. Then go to the file `community.cpp` and inside the `double Community::modularity()` method, add a printout before the `return q` statement like so: ```c++ cerr << "Modularity is: " << q << std::endl; ``` Then run: ``` make ./bin/community -f graph/graph.txt -w -v ``` You should see (among others) this line on the console ``` Modularity is: -0.154286 ```

We get a different modularity value (-0.154286 vs. -0.172653) when every node is in its own singleton community at the very beginning.

Verifying the correct solution by hand

Let's see which of these solutions is correct by carrying out the math by hand.

See details We have our adjacency matrix for this graph given by: ```math A = \begin{pmatrix} 3 & 1 & 0 & 0\\ 1 & 0 & 5 & 7\\ 0 & 5 & 0 & 2.5\\ 0 & 7 & 2.5 & 1 \end{pmatrix} ``` with the initial partition being: ```math \mathcal{C} = \{ c_1, c_2, c_3, c_4 \} = \Bigl\{ \{0\}, \{1\}, \{2\}, \{3\}, \{4\} \Bigr\} ``` For our total weighted number of edges, we get: ```math m = \frac{1}{2} \sum_{u,v} A_{uv} = \frac{1}{2} (3 + 2 + 10 + 14 + 5 + 1) = 17.5 ``` Now, we can calculate modularity using the definition for the vertex degree from [here](https://arxiv.org/pdf/cond-mat/0407503.pdf) (*equation 5*): ```math k_u = \sum_{v} A_{uv} ``` ```math \begin{align} Q(\mathcal{C}) &\coloneqq \frac{1}{2m} \sum_{c\in C} \sum_{u\in c} \sum_{v\in c} \biggl( \underbrace{A_{uv} - \frac{k_u k_v}{2m}}_{\eqqcolon \phi} \biggr)\\ &= \frac{1}{2m} \biggl[ \, \sum_{u,v \in c_1} \phi + \sum_{u,v \in c_2} \phi + \sum_{u,v \in c_3} \phi + \sum_{u,v \in c_4} \phi \biggr]\\ % &= \frac{1}{2m} \biggl[ \Bigl( A_{00} - \frac{k_0 k_0}{2m} \Bigr) + \Bigl( A_{11} - \frac{k_1 k_1}{2m} \Bigr) + \Bigl( A_{22} - \frac{k_2 k_2}{2m} \Bigr) + \Bigl( A_{33} - \frac{k_3 k_3}{2m} \Bigr) \biggr]\\ % &= \frac{1}{35} \cdot \biggl[ \Bigl( 3 - \frac{4^2}{35} \Bigr) + \Bigl( 0 - \frac{13^2}{35} \Bigr) + \Bigl( 0 - \frac{7.5^2}{35} \Bigr) + \Bigl( 1 - \frac{10.5^2}{35} \Bigr) \biggr]\\ % &= - \frac{423}{2450} \approx -0.17265 \end{align} ```

This shows that indeed -0.17265 is the correct modularity value in the beginning.

So what's the "problem" with DirectedLouvain?

In your paper Direction matters..., you give this definition for the directed modularity:

Q_d = \frac{1}{m} \sum_{u,v} \biggl(A_{uv} - \frac{d_u^{in} d_v^\text{out}}{m}\biggr) \delta(C_u, C_v)

with

d_u^{in} = |\{ v\in V: (v,u)\in A\}|, \quad d_u^{out} = |\{ v\in V: (u,v) \in A\}| 

This definition can be broadened for weighted graphs by replacing the size of the set with the sum of the weights of the elements, here their respective edge weights.

In this definition, you did not exclude the node itself, e.g. for node $u$ when we calculate $d_u^{in}$ we also consider $v = u\in V$ in the set. This means, self-loops should be considered in both $d_u^{in}$ as well as in $d_u^{out}$. However, in the code it seems that it is only considered for outgoing degrees.

For this, see the method double Community::modularity(). Inside the for loop, add a display() call and change cout to cerr in display():

Show display-method ```c++ void Community::display() { for (unsigned int i = 0; i < size; ++i) { cerr << " " << (this->g)->correspondance[i] << "/" << node_to_community[i] << "/" << this->communities_arcs[i].total_arcs_inside << "/" << this->communities_arcs[i].total_outcoming_arcs << "/" << this->communities_arcs[i].total_incoming_arcs; cerr << endl; } } ```

You will see this output in the first call to Community::modularity():

0/0/3.000000/4.000000/1.000000
1/1/0.000000/13.000000/13.000000
2/2/0.000000/7.500000/7.500000
3/3/1.000000/10.500000/9.500000

So, then in modularity(), we calculate (with the simplified formula of modularity):

double Community::modularity() {
    double q = 0.;
    double m = g->get_total_weight();
    for (unsigned int i = 0; i < size; ++i) {
        if (this->communities_arcs[i].total_incoming_arcs > 0 || this->communities_arcs[i].total_outcoming_arcs > 0) {
            double total_outcoming_arcs_var     = this->communities_arcs[i].total_outcoming_arcs / m;
            double total_incoming_arcs_var      = this->communities_arcs[i].total_incoming_arcs / m;
            q                                   += this->communities_arcs[i].total_arcs_inside / m - (total_outcoming_arcs_var * total_incoming_arcs_var);
        }
    }
    return q;
}

But as we can see in the output, total_incoming_arcs is just 1 and not the expected 3+1=4 like the total_outcoming_arcs. This is where the different modularity values come from.

Fix

We can easily fix this by explicitly considering the self-loops for $d_u^{in}$:

double Community::modularity() {
  double q = 0.;
  double m = g->get_total_weight();
  for (unsigned int i = 0; i < size; ++i) {
    if (this->communities_arcs[i].total_incoming_arcs > 0 ||
        this->communities_arcs[i].total_outcoming_arcs > 0) {
      double total_outcoming_arcs_var =
          (this->communities_arcs[i].total_outcoming_arcs) / m;
      auto selfloops = g->count_selfloops(i);
      double total_incoming_arcs_var =
          (this->communities_arcs[i].total_incoming_arcs + selfloops) / m;
      q += this->communities_arcs[i].total_arcs_inside / m -
           (total_outcoming_arcs_var * total_incoming_arcs_var);
    }
  }
  return q;
}

Note these two lines

auto selfloops = g->count_selfloops(i);
double total_incoming_arcs_var = (this->communities_arcs[i].total_incoming_arcs + selfloops) / m;

Now we also initially get:

Modularity is: -0.172653 ✅

If this is really a "bug", I am happy to create a pull request with this change. If this is intended behavior, I'm curious to know the reasoning behind it. I'd also be happy if you corrected some errors in thought I may have made during the writing of this.

Thanks in advance, Dominic

anthonimes commented 1 year ago

Hello Dominic,

Thank you so much for this issue. The way you presented it is impressive and we are deeply grateful for that! 

We indeed think that you found a bug here. Hopefully this will not have had too much impact on the communities computed with our algorithm so far (even if there might be some side effects as you mention, this may not have affected the greedy choice made by the algorithm too significantly...)

We would hence really appreciate your pull request with this change. Please do not hesitate to open other issues if you find some other strange behaviors here, we probably made some mistakes at some other places!

And once again thank you for your contribution. Anthony

Splines commented 1 year ago

Hey Anthony, thanks for your answer. I will open a PR soon, but am a bit busy right now, so I might take one more week ;)