The configuration model is implemented incorrectly

szhorvat commented 5 years ago

Note: New Julia user here. I am not at all comfortable with the language, just exploring Julia for use with graphs.

The configuration model appears to be implemented incorrectly. A giveaway is that something like

random_configuration_model(100, fill(50,100))

finishes fast. This should never finish in reasonable time with a correct implementation due to a very high rejection rate.

Looking at the implementation,

https://github.com/JuliaGraphs/LightGraphs.jl/blob/216f5ffa77860d4c39b8d05fe2197d0af75a4241/src/SimpleGraphs/generators/randgraphs.jl#L263

it appears to me that when a double edge or a self-loop would be created, the algorithm tries to recover and generate a different edge. However, this would make the sampling non-uniform.

To ensure uniform sampling, the algorithm should re-start from the beginning (throw away all edges that were created so far) whenever it would create a non-simple graph.

simonschoelly commented 5 years ago

I think you are right on how the algorithm works, it creates stubs for all vertices, and then randomly tries to match them. Whenever a self-loop or multiedge would be created, we skip that edge. Then process is repeated with all the remaining stubs. When a point is reached, where the graph with the desired sequence cannot be created, the whole process starts anew.

So we either should find an argument, why this process does create uniformly distributed graphs or note in the documentation, that the process is not uniform. Then we might implement some algorithms that are indeed uniform.

szhorvat commented 5 years ago

Some notes:

This process does not sample uniformly (this you could check numerically on small examples).
To sample uniformly, the process must be restarted as soon as a multi-edge or self-loop was created. This is called the configuration model.
The function's name is random_configuration_model. Since it does not implement the configuration model, this is very misleading. It would still be misleading if the documentation is updated to note that the sampling is not uniform.

simonschoelly commented 5 years ago

You are right, the process is not uniform. It is possible though, that the process described here has the same distribution. Unfortunately, the author did not provide any sources and I haven't found a description of this process somewhere else. Do you have an intuitive argument at hand, why the process here should not result in the same distribution?

sbromberger commented 5 years ago

it appears to me that when a double edge or a self-loop would be created, the algorithm tries to recover and generate a different edge. However, this would make the sampling non-uniform.

Can you give a proof of bias in edge selection? This assertion is not intuitive to me.

Regardless, from the referenced paper:

Instead, self-loops and multi-edges are allowed. But, these represent a tiny fraction of all edges, and we typically lose little by just discarding or collapsing them.

lafond1 commented 5 years ago

I think the confusion here comes from the name "configuration model" and exactly what people mean when they say "sample uniformly." This is not a configuration model (which produces a graph with an exact degree sequence), it is an Erdos-Renyi model w/ no collisions or self-loops. The degree distribution is not uniform, but the algorithm is sampling possible edges uniformly at random as an Erdos-Renyi generator should.

The algorithm should indeed be significantly more expensive to run for denser graphs compared to sparse ones due to the re-sampling, so if this is not the case there might be some bug in the sampling process. Benchmarking would show if this is the case or not.

simonschoelly commented 5 years ago

Here is a bit of an experiment, that shows, that the two models do not behave differently. Although connectivity might not be the best example, as the two models might behave behave asymptotically similar.

I will have to have a look at the algorithm again, but if if truly samples from Erdos Renyi, conditioned on the degree sequence, then that might be even better. We could rename it then, and implement a slow configuration model using rejection sampling, until someone takes the time to implement something better.

using LightGraphs, Random, Base.Iterators, Statistics

function rand_reg_shouldbe(n, k)
   stubs = collect(flatten(repeated(1:n, k)))
   m = div(n * k, 2) 
   edges = Vector{Tuple{Int, Int}}(undef, m)
   @inbounds while true
      shuffle!(stubs)
      no_loop = true
      for i in 1:m
     src = stubs[2*i - 1]
     dst = stubs[2*i]
     if src == dst
        no_loop = false
        break
     end
     edges[i] = minmax(src, dst) 
      end
      no_loop || continue
      sort!(edges)
      allunique(edges) || continue
      return SimpleGraphFromIterator(Edge.(edges))
   end
end

function rand_reg_is(n, k)
   return random_configuration_model(n, collect(repeated(k, n)))
end 

# Probability that k-regular graph with n vertices is connected 
# in the random configuration model.
p_shouldbe(n, k, nsamples) =  mean(is_connected(rand_reg_shouldbe(n, k)) for _ in 1:nsamples)

# Probability that k-regular graph with n vertices is connected 
# using our version of the random configuration model
p_is(n, k, nsamples) =  mean(is_connected(rand_reg_is(n, k)) for _ in 1:nsamples)

julia> p_shouldbe(1000, 2, 100_000)
0.05971

julia> p_is(1000, 2, 100_000)
0.04722

sbromberger commented 5 years ago

I think we're getting a bit far afield here. I am reluctant to change this function given that we're worried about very small differences in probability (that might be eclipsed by the nonuniformity of the PRNG used) for large graphs.

One thing that I did note is that we should be asserting isgraphical on this before we start; otherwise we can have a nonterminating condition:

a = [8, 8, 3, 3, 3, 3, 0, 0, 8]
g = random_configuration_model(9, a)

will not converge, because

julia> isgraphical(a)
false

szhorvat commented 5 years ago

I am not sure how to respond ... there seem to be too much confusion about concepts and misuse of terminology above to respond point by point. I would suggest familiarizing with the topic before proceeding. E.g., M.E.J. Newman: Networks, 2nd ed, has a good description of the configuration model.

I'll just note that the current situation severely diminishes my trust in this package.

sbromberger commented 5 years ago

I'll just note that the current situation severely diminishes my trust in this package.

Sorry to hear that. I hope you find a package that works better for your needs.

sbromberger / LightGraphs.jl

The configuration model is implemented incorrectly #1208