sbromberger / LightGraphs.jl

An optimized graphs package for the Julia programming language
Other
672 stars 185 forks source link

[BUG] Wrong values of Betweenness centrality #1405

Closed sergio-gomez closed 4 years ago

sergio-gomez commented 4 years ago

Description of bug Betweenness centrality reports meaningless values when the network is a bit large

How to reproduce julia> G = SimpleGraphs.grid([50, 50]) julia> btw = betweenness_centrality(G, normalize=false, endpoints=false) julia> maximum(btw) 6.360647079664976e10

Expected behavior I've calculated the maximum betweenness with a different software (Radatools), and the result should be: 90107.6985

Actual behavior The maximum betweenness cannot be larger than the number of pairs julia> pairs = div((nv(G)-1) * (nv(G)-2), 2) 3121251

In fact, there are many nodes with betweenness reported larger than the number of pairs: julia> btw_large = btw[btw .> pairs] julia> length(btw_large) 2264

As a reference, the network (a 50x50 grid) has 2500 nodes.

Code demonstrating bug See above.

Version information

Additional context Betweenness is calculated correctly for smaller networks (in particular, smaller grids). For example, a grid 21x21 has maximum betweenness 6369.052660310278 which I've replicated with Radatools. There is a factor 2 between Radatools and LightGraphs values, which is not relevant, because Radatools considers as different the pairs (i, j) and (j, i).

sbromberger commented 4 years ago

Yup, confirmed.

sergio-gomez commented 4 years ago

Looking at the code, it does not seem a good idea to find first all the paths (Dijkstra or BFS) and then use them all to calculate the betweenness. This means storing a lot of information, which is not needed, and may harm the analysis of large networks. It would be better to directly implement Brandes' algorithm. Moreover, the same algorithm could be used to calculate the betweenness of all the edges.

sbromberger commented 4 years ago

it does not seem a good idea to find first all the paths (Dijkstra or BFS) and then use them all to calculate the betweenness.

We don't do that. We accumulate after each path calculation and then throw away the shortest path result, keeping only one in memory (and the accumulator) at any time.

It looks like there are two problems here, and they manifest particularly easily with grid graphs. The path counts for vertices in the bottom right of a grid grow exponentially large. This is to be expected (and confirmed with NetworkX), but this overflows eltype(g). Making this a Float64 instead reduces precision but gives us > 1e308 maximum path count.

The second problem looks to be somewhere in the accumulate_basic function. I'm still working on tracking it down.

sbromberger commented 4 years ago

Some notes:

divergence happens sometime around grid(35, 35) and increases from there. Changing the variables in _accumulate_basic! to BigFloat with default (256-bit) precision doesn't help, so if it's a FP accuracy error it's happening elsewhere.

I still think this is a problem when the pathcount gets too high. With grid(36,36) the maximum pathcount is at the last vertex (1296) with 1.1218627781666285e20 paths.

I have confirmed that the pathcount values returned by the shortest_paths algorithms are correct. This suggests that the problem is in the _accumulate_basic! or rescale! functions within the centrality code. Since they're essentially drop-in translations of the NetworkX python code, this shouldn't be an issue... but it is.

sergio-gomez commented 4 years ago

I can confirm that the four corners of the grid (not only node 1296) generate 1.12e20 shortest paths: julia> sum([multinomial(BigInt(i), BigInt(j)) for i in 0:34 for j in 0:34]) - 1 112186277816662845430

sergio-gomez commented 4 years ago

Ups, the value 112186277816662845430 is for the grid 35x35, not 36x36, as you can see in the loops, ranging from 0 to 34.

My working code (Radalib/Radatools), written in Ada, makes use of Doubles instead of Ints to accumulate the paths, and directly implements the Brandes' algorithm. Anyway, a solution based on BigInt seems much more appropriate.

sbromberger commented 4 years ago

Anyway, a solution based on BigInt seems much more appropriate.

It's not, because it would kill performance. Even with Float64 where the upper bound is ~1e308, we're seeing accumulation errors because the denominator is so large. BigFloat with 256-bit precision fares no better.

The implementation of Brandes' algorithm is a red herring, I think. Both algorithms calculate the total number of shortest paths to each vertex, and both produce the same large number. The difference between Brandes' queue-based BFS and the frontier-based BFS is one of nearest-vertex sort stability; the ordering of neighbors of identical distance is different but that shouldn't matter at all. The frontier-based BFS takes advantage of cache locality and is up to 3x faster than Brandes' queue-based approach.

My focus now is on the accumulator. I think what's happening is that we're seeing float drift or something during the division of two very large floating point numbers, and that's getting exacerbated by the summation of the result. More testing later today/tomorrow should confirm the issue.

sbromberger commented 4 years ago

As it turns out the troubleshooting problem was developer error: I was using a deprecated version of the centrality measure that defaulted to Dijkstra, which I hadn't applied the fix to yet. Once we moved both shortest paths algorithms to Float64 pathcounts from whatever Integer type they were using, things started working. Waiting for tests to finish and then will merge #1406 and close this out.

For a Grid([50, 50]),

julia> maximum(bc)                                                                                                                                     
90107.69863748763
sbromberger commented 4 years ago

Version 1.3.2 was released with this fix.