Closed sergio-gomez closed 4 years ago
Yup, confirmed.
Looking at the code, it does not seem a good idea to find first all the paths (Dijkstra or BFS) and then use them all to calculate the betweenness. This means storing a lot of information, which is not needed, and may harm the analysis of large networks. It would be better to directly implement Brandes' algorithm. Moreover, the same algorithm could be used to calculate the betweenness of all the edges.
it does not seem a good idea to find first all the paths (Dijkstra or BFS) and then use them all to calculate the betweenness.
We don't do that. We accumulate after each path calculation and then throw away the shortest path result, keeping only one in memory (and the accumulator) at any time.
It looks like there are two problems here, and they manifest particularly easily with grid graphs. The path counts for vertices in the bottom right of a grid grow exponentially large. This is to be expected (and confirmed with NetworkX), but this overflows eltype(g)
. Making this a Float64
instead reduces precision but gives us > 1e308 maximum path count.
The second problem looks to be somewhere in the accumulate_basic
function. I'm still working on tracking it down.
Some notes:
divergence happens sometime around grid(35, 35)
and increases from there. Changing the variables in _accumulate_basic!
to BigFloat
with default (256-bit) precision doesn't help, so if it's a FP accuracy error it's happening elsewhere.
I still think this is a problem when the pathcount gets too high. With grid(36,36)
the maximum pathcount is at the last vertex (1296) with 1.1218627781666285e20
paths.
I have confirmed that the pathcount values returned by the shortest_paths algorithms are correct. This suggests that the problem is in the _accumulate_basic!
or rescale!
functions within the centrality code. Since they're essentially drop-in translations of the NetworkX python code, this shouldn't be an issue... but it is.
I can confirm that the four corners of the grid (not only node 1296) generate 1.12e20 shortest paths: julia> sum([multinomial(BigInt(i), BigInt(j)) for i in 0:34 for j in 0:34]) - 1 112186277816662845430
Ups, the value 112186277816662845430 is for the grid 35x35, not 36x36, as you can see in the loops, ranging from 0 to 34.
My working code (Radalib/Radatools), written in Ada, makes use of Doubles instead of Ints to accumulate the paths, and directly implements the Brandes' algorithm. Anyway, a solution based on BigInt seems much more appropriate.
Anyway, a solution based on BigInt seems much more appropriate.
It's not, because it would kill performance. Even with Float64 where the upper bound is ~1e308, we're seeing accumulation errors because the denominator is so large. BigFloat
with 256-bit precision fares no better.
The implementation of Brandes' algorithm is a red herring, I think. Both algorithms calculate the total number of shortest paths to each vertex, and both produce the same large number. The difference between Brandes' queue-based BFS and the frontier-based BFS is one of nearest-vertex sort stability; the ordering of neighbors of identical distance is different but that shouldn't matter at all. The frontier-based BFS takes advantage of cache locality and is up to 3x faster than Brandes' queue-based approach.
My focus now is on the accumulator. I think what's happening is that we're seeing float drift or something during the division of two very large floating point numbers, and that's getting exacerbated by the summation of the result. More testing later today/tomorrow should confirm the issue.
As it turns out the troubleshooting problem was developer error: I was using a deprecated version of the centrality measure that defaulted to Dijkstra, which I hadn't applied the fix to yet. Once we moved both shortest paths algorithms to Float64
pathcounts from whatever Integer type they were using, things started working. Waiting for tests to finish and then will merge #1406 and close this out.
For a Grid([50, 50]),
julia> maximum(bc)
90107.69863748763
Version 1.3.2 was released with this fix.
Description of bug Betweenness centrality reports meaningless values when the network is a bit large
How to reproduce julia> G = SimpleGraphs.grid([50, 50]) julia> btw = betweenness_centrality(G, normalize=false, endpoints=false) julia> maximum(btw) 6.360647079664976e10
Expected behavior I've calculated the maximum betweenness with a different software (Radatools), and the result should be: 90107.6985
Actual behavior The maximum betweenness cannot be larger than the number of pairs julia> pairs = div((nv(G)-1) * (nv(G)-2), 2) 3121251
In fact, there are many nodes with betweenness reported larger than the number of pairs: julia> btw_large = btw[btw .> pairs] julia> length(btw_large) 2264
As a reference, the network (a 50x50 grid) has 2500 nodes.
Code demonstrating bug See above.
Version information
output from
versioninfo()
surrounded by backticks (``) Julia Version 1.4.0 Commit b8e9a9ecc6 (2020-03-21 16:36 UTC) Platform Info: OS: Windows (x86_64-w64-mingw32) CPU: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-8.0.1 (ORCJIT, skylake)output from
] status LightGraphs
surrounded by backticks (``) [093fc24a] LightGraphs v1.3.1Additional context Betweenness is calculated correctly for smaller networks (in particular, smaller grids). For example, a grid 21x21 has maximum betweenness 6369.052660310278 which I've replicated with Radatools. There is a factor 2 between Radatools and LightGraphs values, which is not relevant, because Radatools considers as different the pairs (i, j) and (j, i).