sbromberger / LightGraphs.jl

An optimized graphs package for the Julia programming language
Other
672 stars 184 forks source link

Memory management #1444

Closed erlebach closed 4 years ago

erlebach commented 4 years ago

Suppose I have an empty MetaGraph G of size N = 200_000. I also wish to create much smaller meta_graphs g[i] of different types to be embedded into G. These MetaGraphs have node labels n[i], whose values might be greater than nv(g[i]). So these values n[i] is stored as graph metadata of g[i]. I loop through the smaller graphs and update G with their weight and edge values, using the node metadata to specify edge source and destinations. Of course, each MetaGraph is instantiated from a SimpleGraph, so there is associated memory allocation, which then is garbage collected.

I would like to find a method that would allow me to preallocated the largest sparse array I expect to need for my MetaGraphs and reuse it. That would require a clearing operation for the sparse matrix, and maintaining the required size (equal to the number of nodes), which would be different than the maximum size, which is the size of the maximum MetaGraph.

Am I making any sense?

I have created a minimal working code. Perhaps I have a serious misunderstanding. Very likely.

using LightGraphs, SimpleWeightedGraphs, MetaGraphs, CSV
using DataFrames, Distributions, Random

# ----------------------------------------------------------------------
function myMerge!(master_graph::MetaGraph, meta_graph::MetaGraph, weight::Float64)
    n = get_prop(meta_graph, :nodes)
    for e in edges(meta_graph)
        i = src(e)
        j = dst(e)
        add_edge!(master_graph, n[i], n[j], :weight, weight)
    end
end
# ----------------------------------------------------
function createWorkGraph(master_size, groups, β_strogatz, weight::Float64)
    elapsed_setup = zeros(15000)
    elapsed_metagraph = zeros(15000)
    elapsed_merge = zeros(15000)

    master_graph = MetaGraph(master_size)

    for (i,grp) in enumerate(groups)
        elapsed_setup[i] = @elapsed begin
            person_ids = grp
            sz = length(person_ids)
            if i % 100 == 0
                println("i,sz: $i, $sz")
            end
            if sz > 5000
                println("###################################")
                println("sz > 5000 (= $sz), SKIP OVER")
                println("###################################")
                continue
            end
        end
        elapsed_metagraph[i] = @elapsed begin
            if sz == 1
                continue
            elseif sz < 5
                mgh = createDenseGraph(person_ids) #p::Float64)
            elseif sz < 25
                mgh = createErdosRenyiGraph(person_ids, 10) #p::Float64)
            elseif sz < 100
                mgh = watts_strogatz(sz, 15, β_strogatz) # erdos-redyi
            else
                mgh = watts_strogatz(sz, 10, β_strogatz)
            end
        end

        if sz > 1
            elapsed_merge[i] = @elapsed begin
                mg  = MetaGraph(mgh, weight)
                set_prop!(mg, :nodes, person_ids)
                myMerge!(master_graph, mg, weight)
            end
        end
    end
    println("setup time elapsed: ", sum(elapsed_setup))
    println("metagraph time elapsed: ", sum(elapsed_metagraph))
    println("merge time elapsed: ", sum(elapsed_merge))
    return master_graph
end
# ----------------------------------------------------
function createErdosRenyiGraph(group, m::Int) #p::Float64)
    n = length(group)
    p = 2*m / (n-1)
    graph = erdos_renyi(n, p)
end
# ----------------------------------------------------
# 1) create a dense graph, given a group
function createDenseGraph(group)
    n = length(group)
    # add Metadata to the nodes and edges with the person_id, or a unique id in 1:population_size
    graph = SimpleGraph(n, div(n*(n-1),2))
end
# ----------------------------------------------------
function generateDemographicGraphs()

    # Create a fake dataframe with columns :person_id, :work_id, :school_id
    # I want 60 schools, 10,000 workplaces, 250,000 people
    nb_people = 250000
    nb_businesses = 10000
    work_sizes = [3, 70, 300, 5000]
    work_prob = [.7, .12, .09, .05]
    work_prob ./=  sum(work_prob)
    categ = Distributions.Categorical(work_prob)

    work_ids   = collect(1:nb_businesses)
    work_classes = rand(categ, nb_businesses)

    work_id = rand(work_classes, nb_people)
    employees = randperm(nb_people)

    persons = collect(1:nb_people)
    # Create groups
    work_groups = []
    for i in 1:nb_businesses
        work_class = work_classes[i]
        business_size = work_sizes[work_class]
        employees = sample(persons, business_size, replace=true)
        push!(work_groups, employees)
    end
    println("nb_businesses: $nb_businesses")

    # There are actually 10,000 groups
    work_groups = work_groups[1:1000]

    tot_people = sum(map(length,work_groups))

    println("Start work")
    cwgraph = @elapsed work_graph = createWorkGraph(tot_people, work_groups, 0.31, 0.8)
    println("Finished work")

    println("total people: ", tot_people)
    println(" time to execute createWorkGraph: $cwgraph")

    return work_graph
end
# ----------------------------------------------------------------------
@time work_graph = generateDemographicGraphs()
setup time elapsed: 0.0018112059999999999
metagraph time elapsed: 0.29343853900000005
merge time elapsed: 11.430247642999998
Finished work
total people: 344235
 time to execute createWorkGraph: 11.770783486
 11.858278 seconds (27.95 M allocations: 2.653 GiB, 67.12% gc time)

Some notes. 1) I am using @elapsed so that I can print running times the way I want. 2) Here are my timings the second time I make the run. Notice the amount of time spent in myMerge! It seems that the merging becomes increasingly expensive as I add more edges. Why is this? How do you check whether a vertex is within bounds. I would argue that a version should exist where we can assume the edge vertices are in bound, since you go through the trouble of using @inbounds.

One approach would be to allocate a large static graph, insert the edges into the static graph and when complete, transfer the static graph to a Metagraph.

Here is what I will try: accumulate the edges into my own list, and create my own add_edges routines without the checks. I believe you could add a routine called add_edges(), which would add them in bulk. Then you only have to check whether the min and max edge vertex is within bounds as opposed to each vertex. I will let you know how it goes.

sbromberger commented 4 years ago

I also wish to create much smaller meta_graphs g[i] of different types to be embedded into G

I don't know what this means, but

These MetaGraphs have node labels n[i], whose values might be greater than nv(g[i]).

this is almost certainly the wrong way to go about doing it.

I would like to find a method that would allow me to preallocated the largest sparse array I expect to need for my MetaGraphs and reuse it.

MetaGraphs don't use sparse arrays. They use SimpleGraphs under the hood, which use adjacency lists.

PS: I don't have the cycles right now to evaluate that code - it's far from minimal. I might have some time in the next couple of weeks, or someone else might chime in here. Sorry - crunch time at work.

PPS: also, @elapsed isn't sufficient here. Part of providing a MWE is being able to run real benchmarks on it.

sbromberger commented 4 years ago

Also, quickly:

It seems that the merging becomes increasingly expensive as I add more edges. Why is this?

It shouldn't be too much more expensive, but it should generally grow as the log of the max degree of the graph. You might also be allocating eprops for each new edge, but I can't imagine that it's a significant time sink. Can you show me some minimal code that demonstrates clearly that the size of the graph impacts the performance of add_edge!?

How do you check whether a vertex is within bounds.

For SimpleGraphs and MetaGraphs, this is an O(1) check: vertices(g) is a unit range, and membership within it is two simple comparisons: 1 <= v <= range[end].

I would argue that a version should exist where we can assume the edge vertices are in bound, since you go through the trouble of using @inbounds.

We use @inbounds precisely BECAUSE we've done the check and can guarantee that the subsequent index will not produce illegal memory accesses. The check is O(1) and is likely inlined to boot. It's not the source of your performance problems:

foo(g::AbstractGraph, n::Integer) = n in vertices(g)

julia> @code_native foo(g, 4)
        .text
; ┌ @ REPL[11]:1 within `foo'
; │┌ @ SimpleTraits.jl:338 within `vertices' @ SimpleTraits.jl:331
; ││┌ @ interface.jl:206 within `macro expansion'
; │││┌ @ SimpleGraphsCore.jl:39 within `nv'
; ││││┌ @ SimpleGraphsCore.jl:59 within `fadj'
; │││││┌ @ REPL[11]:1 within `getproperty'
        movq    8(%rdi), %rax
; ││││└└
; ││││┌ @ array.jl:221 within `length'
        movq    8(%rax), %rax
; │││└└
; │││┌ @ range.jl:320 within `OneTo' @ range.jl:311
; ││││┌ @ promotion.jl:409 within `max'
        movq    %rax, %rcx
        sarq    $63, %rcx
        andnq   %rax, %rcx, %rax
; │└└└└
; │┌ @ range.jl:997 within `in'
; ││┌ @ int.jl:410 within `<='
        testq   %rsi, %rsi
        setg    %cl
        cmpq    %rsi, %rax
        setge   %al
; ││└
; ││┌ @ bool.jl:40 within `&'
        andb    %cl, %al
; │└└
        retq
        nopw    %cs:(%rax,%rax)
; └
erlebach commented 4 years ago

Thanks. But quickly, when I run small examples with my code, I do not have the large discrepancies I am showing. I fully understand that I did not provide MWE according to normal standards, but it was the best I could do at this time.

Ok, I am obviously doing something with my code that creates time anomalies, and for all I know, it has nothing to do with any of your libraries. I am new to Julia after all. But not to coding.

erlebach commented 4 years ago

Here is a minimal example. See if you can run this without changing any parameters. Notice the loop 1:100. If I change this to 1:10, all is fine. Change to 1:2, all is better.

function test()
    master_size = 200_000
    master_graph = MetaGraph(master_size)
    sz = 10000

    @time for m in 1:100
        println(m)
        mgh = watts_strogatz(sz, 10, 0.3)
        nodes = sample(1:master_size, sz, replace=false)
        weight = 0.3

        for (i,e) in enumerate(edges(mgh))
            s, d = src(e), dst(e)
            add_edge!(master_graph, nodes[s], nodes[d], :weights, weight)
        end
    end
    println(ne(master_graph))
    return master_graph, mgh
end

mg, mgh = test()
sbromberger commented 4 years ago

My concern here is several-fold:

1) Why are you timing a random process? There's too much variability in watts_strogatz to infer anything about the results.

2) you're timing things that don't matter: the println, the sample

3) why are you enumerating the edges when you don't do anything with i?

4) I'll see if I can work on this tonight.

erlebach commented 4 years ago

Here is a better MWE. Written before I saw your comments (I thought you had no time :-) ). Yes, I also have other work to do. :-) The test below tells you what I wish to do.

function test_2()
    master_size = 200_000
    master_graph = MetaGraph(master_size)
    sz = 1000
    mgh = watts_strogatz(sz, 10, 0.3)
    weight = 0.3
    nodes = zeros(Int64, sz)

    for m in 1:1000
        sample!(1:master_size, nodes, replace=false)
        @time for (i,e) in enumerate(edges(mgh))
            s, d = src(e), dst(e)
            add_edge!(master_graph, nodes[s], nodes[d], :weights, weight)
        end
    end
    println("--------------------")
    println("nb edges in master_graph: ", ne(master_graph))
    return master_graph, mgh
end

mg, mgh = test_2()

Look at the garbage collection. This should go extremely fast, but add_edge! slows down. Here is a sample of the times I see. Most of the Julia programmers do not like to use @time, but I do not agree with that reticence. The second time I run this test, the module is compiled, so time is reliable. What determines the point at which GC is invoked (yes, that is too low level, and I really do not care).

Here is a sample of the times:

  0.008210 seconds (51.21 k allocations: 6.258 MiB)
  0.007726 seconds (51.23 k allocations: 6.264 MiB)
  0.007855 seconds (51.31 k allocations: 6.266 MiB)
  0.007544 seconds (51.25 k allocations: 6.274 MiB)
  0.008105 seconds (51.22 k allocations: 6.258 MiB)
  0.008023 seconds (51.17 k allocations: 6.272 MiB)
  0.007906 seconds (51.22 k allocations: 6.262 MiB)
  0.008032 seconds (51.17 k allocations: 6.267 MiB)
  0.007932 seconds (51.20 k allocations: 6.266 MiB)
  0.008175 seconds (51.25 k allocations: 6.279 MiB)
erlebach commented 4 years ago

Here is a third test, pared down some more:

function test_3()
    master_graph = MetaGraph(master_size)
    weight = 0.3

    #@time for i in 1:sz
    @time for i in 1:1000000
        s, d = nodes_src[i], nodes_dst[i]
        add_edge!(master_graph, s, d, :weights, weight)
    end
    println("--------------------")
    println("nb edges in master_graph: ", ne(master_graph))
    return master_graph
end

test_3()

It takes 20 sec to add a million edges. Is that reasonable to you? It takes 0.014 sec to add 10_000 edges. If the code scaled correctly, it should take 1.4 seconds to add a million edges. What happens when you run this code?

sbromberger commented 4 years ago

I can't run this. Where are you defining nodes_src and nodes_dst?

sbromberger commented 4 years ago

I'm not getting your results:

julia> function doit!(mg, ns, nd)
           for i = 1:length(ns)
              s, d = ns[i], nd[i]
              add_edge!(mg, s, d, :weights, 0.3)
           end
           return mg
       end
doit! (generic function with 1 method)

julia> @benchmark doit!(mg, ns, nd) setup=(mg = MetaGraph(200_000); ns=rand(1:200_000, 1_000_000); nd=rand(1:200_000, 1_000_000))
BenchmarkTools.Trial: 
  memory estimate:  1.32 GiB
  allocs estimate:  10546722
  --------------
  minimum time:     1.838 s (0.00% GC)
  median time:      2.582 s (31.07% GC)
  mean time:        2.582 s (31.07% GC)
  maximum time:     3.326 s (48.24% GC)
  --------------
  samples:          2
  evals/sample:     1 
sbromberger commented 4 years ago

I think part of your problem is that you're using global variables. Don't do that; the compiler can't guarantee that they won't change type in the middle of the loop: https://docs.julialang.org/en/v1/manual/performance-tips/index.html#Avoid-global-variables-1

erlebach commented 4 years ago

My apologies


const master_size = 200_000
const sz = 1_000_000
const nodes_src = zeros(Int64, sz)
const nodes_dst = zeros(Int64, sz)
sample!(1:master_size, nodes_src, replace=true)
sample!(1:master_size, nodes_dst, replace=true)

function test_3()
    master_graph = MetaGraph(master_size)
    weight = 0.3

    #@time for i in 1:sz
    @time for i in 1:1000000
        s, d = nodes_src[i], nodes_dst[i]
        add_edge!(master_graph, s, d, :weights, weight)
    end
    println("--------------------")
    println("nb edges in master_graph: ", ne(master_graph))
    return master_graph
end

test_3()
erlebach commented 4 years ago

My original codes were not using global variables. My latest tests were because I was continuously simplifying.

I do not understand the construction of your last benchmark. How does setup work? You are using it to set the values of arguments. I will look it up.

I can duplicate your results on your example, written in a form I understand:

function doit!(mg, ns, nd)
   for i = 1:length(ns)
      s, d = ns[i], nd[i]
      add_edge!(mg, s, d, :weights, 0.3)
   end
   return mg
end

function test_code()
    mg = MetaGraph(200_000)
    ns=rand(1:200_000, 1_000_000)
    nd=rand(1:200_000, 1_000_000)
    @time doit!(mg, ns, nd)
end
3.701726 seconds (10.55 M allocations: 1.316 GiB, 56.05% gc time)
3.021390 seconds (10.55 M allocations: 1.316 GiB, 46.68% gc time)
2.758614 seconds (10.55 M allocations: 1.316 GiB, 42.58% gc time)

You might be right about the global variables, although I thought I was paying attention. I really do not like that aspect of Julia, although I am coming to appreciate its other qualities.

erlebach commented 4 years ago

Now let us answer your previous questions:

My concern here is several-fold:

Why are you timing a random process? There's too much variability in watts_strogatz to infer anything about the results. ANSW: my use case will have that random element in it. I will be creating about 10,000 small graphs, whose edges will be incorporated into the larger MetaGraph. Am I making sense? My examples were only illustrative.

you're timing things that don't matter: the println, the sample. ANSW: I agree. But that will not affect the timings. One should be able to distinguish O(epsilon) versus O(1) effects. I understand what you mean though. But it is real world situations I am interested in, not ideal cases.

why are you enumerating the edges when you don't do anything with i? I do not do anything with it because I removed the parts of the code that were not considered a part of the MWE. If you have a more complete version already. I enumerate the edges of the smaller graph and associate them with node numbers in the larger graph, after which I fill the larger graph.

I think you have helped me enough unless you have personal curiosity. I do not wish to abuse your time!

sbromberger commented 4 years ago

I'm happy to continue helping, though I am still frustrated by the lack of rigor in your benchmarks. Please - don't use @time or @elapsed - make proper benchmarks that are statistically sound and reproducible. That's what I did with doit!() above; and yes - the setup=() does what you think - it initializes these (local) variables prior to every call of the function, and does not count the setup in the reported times.

You are currently the beneficiary of two cancelled meetings :)

erlebach commented 4 years ago

Ok. I will use @benchmark. In yourcase, it only ran twice. Is that really different than running @time twice? If a benchmark is very fast and slows down in a real world case, is the benchmark still useful? It was a rhetorical question.

sbromberger commented 4 years ago

Is that really different than running @time twice?

Yes. It's equivalent, I think, to running @time a few times to get any caching warmed up, and then running twice, collecting the results each time while minimizing the impact of global variables and other things that @time is not aware of.

There's a reason that BenchmarkTools exists, and for those of us who were around in the dark times before it, there's no way we'd go back. (Yes, argument from authority, but indulge me.)

(PS: I make this mistake occasionally too, but if you use @foo in a github comment, you should enclose it in backticks, otherwise the github user whose name is "foo" gets pinged.)

Also, I know you asked this rhetorically, but I feel compelled to respond:

If a benchmark is very fast and slows down in a real world case, is the benchmark still useful?

When someone can demonstrate a real-world example of this inconsistency happening, the BenchmarkTools folks would love to hear about it.

erlebach commented 4 years ago

Try your DoIt! with a million, 100_000, and 10_000 count for nd and ns. Do the results make sense to you? Yes, the cost might be logarithmic with respect to the total edge count. A factor 10 increase in the number of edges should correspond to an increase in execution time of about 23 (~10 log(10)). That seems right.

I would like to provide a MWE that is closer to the code I am executing but much simpler than what I provided before. But honestly, it does seem that you made your point. I do not know why I get the huge times, except that the execessive memory allocations due to the graphs I generate activate garbage collection, slowing things down.

Let me ask: I have a loop 10,000 long, within which I create a watts_strogatz graph of different sizes. I would like to collect the edges of this graph. How can I do this without incurring the memory allocations. I'd like to preallocate memory, generate smaller Strogatz graphs within this memory, collect all the edges, and then blast them into the master_graph. The less the memory allocations, the faster the code. I could do that of course by bypassing LightGraphs but I would rather not do that.

sbromberger commented 4 years ago

Looks linear to me:


julia> @benchmark doit!(mg, ns, nd) setup=(mg = MetaGraph(200_000); ns=rand(1:200_000, 1_000_000); nd=rand(1:200_000, 1_000_000))
BenchmarkTools.Trial: 
  memory estimate:  1.32 GiB
  allocs estimate:  10546889
  --------------
  minimum time:     1.809 s (0.00% GC)
  median time:      2.634 s (32.47% GC)
  mean time:        2.634 s (32.47% GC)
  maximum time:     3.460 s (49.45% GC)
  --------------
  samples:          2
  evals/sample:     1

julia> @benchmark doit!(mg, ns, nd) setup=(mg = MetaGraph(200_000); ns=rand(1:200_000, 100_000); nd=rand(1:200_000, 100_000))
BenchmarkTools.Trial: 
  memory estimate:  134.71 MiB
  allocs estimate:  1126968
  --------------
  minimum time:     137.399 ms (0.00% GC)
  median time:      203.844 ms (38.07% GC)
  mean time:        223.100 ms (42.36% GC)
  maximum time:     491.767 ms (70.37% GC)
  --------------
  samples:          22
  evals/sample:     1

julia> @benchmark doit!(mg, ns, nd) setup=(mg = MetaGraph(200_000); ns=rand(1:200_000, 10_000); nd=rand(1:200_000, 10_000))
BenchmarkTools.Trial: 
  memory estimate:  13.44 MiB
  allocs estimate:  118975
  --------------
  minimum time:     9.659 ms (0.00% GC)
  median time:      10.916 ms (0.00% GC)
  mean time:        14.703 ms (26.59% GC)
  maximum time:     122.891 ms (89.76% GC)
  --------------
  samples:          223
  evals/sample:     1

I have a loop 10,000 long, within which I create a watts_strogatz graph of different sizes. I would like to collect the edges of this graph. How can I do this without incurring the memory allocations.

What memory allocations? You will be allocating memory when you create the graph, and whenever you add an edge to the graph. You should not allocate memory when iterating over the edges of the graph, but you will allocate memory if you collect them.

I'd like to preallocate memory, generate smaller Strogatz graphs within this memory, collect all the edges, and then blast them into the master_graph.

This won't help as much as you think. collect allocates. But in any case, there's no way to preallocate memory for a graph structure. Think about what you know about how SimpleGraphs are stored: that is, as a vector of vectors of neighbors. How do you preallocate those vectors unless you know exactly how many elements each has? (And if you have that, then just do it manually and pass the vector of vectors into SimpleGraphs as the fadjlist. But this is unsupported.)

erlebach commented 4 years ago

Thank you Seth. I think I have all I need to solve my problems. Unless I can come up with a simple example that duplicates what I measured during my development, I do not think we can progress further.

Your timings are similar to mine, but do not go up quite as sharply when the number of edges increases. I really appreciate the time you took to help.

Here are my timings on the DoIt! benchmark: (the times decrease by a factor of 15 and 18, in your benchmarks, the respective factors are 13 and 14):

julia> @benchmark doit!(mg, ns, nd) setup=(mg = MetaGraph(200_000); ns=rand(1:200_000, 1_000_000); nd=rand(1:200_000, 1_000_000))
BenchmarkTools.Trial: 
  memory estimate:  1.32 GiB
  allocs estimate:  10546944
  --------------
  minimum time:     2.564 s (38.65% GC)
  median time:      2.587 s (39.04% GC)
  mean time:        2.587 s (39.04% GC)
  maximum time:     2.610 s (39.42% GC)
  --------------
  samples:          2
  evals/sample:     1

julia> @benchmark doit!(mg, ns, nd) setup=(mg = MetaGraph(200_000); ns=rand(1:200_000, 100_000); nd=rand(1:200_000, 100_000))
BenchmarkTools.Trial: 
  memory estimate:  134.71 MiB
  allocs estimate:  1126894
  --------------
  minimum time:     165.497 ms (24.21% GC)
  median time:      206.858 ms (28.45% GC)
  mean time:        216.699 ms (38.43% GC)
  maximum time:     353.596 ms (64.76% GC)
  --------------
  samples:          22
  evals/sample:     1

julia> @benchmark doit!(mg, ns, nd) setup=(mg = MetaGraph(200_000); ns=rand(1:200_000, 10_000); nd=rand(1:200_000, 10_000))
BenchmarkTools.Trial: 
  memory estimate:  13.44 MiB
  allocs estimate:  118991
  --------------
  minimum time:     9.894 ms (0.00% GC)
  median time:      11.511 ms (0.00% GC)
  mean time:        13.937 ms (18.32% GC)
  maximum time:     39.215 ms (68.34% GC)
  --------------
  samples:          195
  evals/sample:     1
erlebach commented 4 years ago

I saved time in the following way. Here is a code fragment, which reduces memory allocations. This loop is entered thousands of times: a SimpleGraph mgh is created and embeded into master_graph as you do in DoIt!(). I think I now have a handle on code performance, although still do not understand the severe drop when I use the MetaGraph appraoch in the code fragment. My guess: the additional garbage collection is detrimental.

if sz > 1
            #println("Metagraph")
            begin
                # very efficient approach (mgh is a SimpleGraph
                myMerge!(master_graph, mgh, person_ids, weight)

                # inefficient approach
                #mg  = MetaGraph(mgh, weight)
                #set_prop!(mg, :nodes, person_ids)
               #myMerge!(master_graph, mg, weight)  # mg is a MetaGraph
            end
erlebach commented 4 years ago

I resolved the issue. The watts_strogatz method does not return for certain values of the parameters. I submitted a bug report. While you have an assertion error emitted when the second argument is smaller or equal than the first, that is not sufficient. It must be "sufficiently smaller", or the third argument must be "small enough". Why this problem occurs, I do not know. I have not checked the source code.

sbromberger commented 4 years ago

I left the explanation in the comments of the other issue. It's known behavior for iterative random graph generators.

erlebach commented 4 years ago

I no longer see the ohter issue. I figured the issue must be known. But why isn’t there some kind of asert error or some other notice. Can’t the conditions for this issue be detected? After all, you do have an assert comment when the second argument is larger than the first. Is this a bug or a feature of the algorithm?

sbromberger commented 4 years ago

At what point would you assert? The issue is that the edges are selected randomly but they're selected also based on the state of the graph at the time. Most random graph generators that select edges iteratively do not handle creation of dense graphs in any efficient manner. But you might get lucky and pick that last remaining edge right away, or you might be unlucky at the very beginning and keep picking edges that already exist.

The alternative, to sample without replacement from a set of unused edges, would require V^2 memory and wouldn't scale for any decently-sized graph. We tend to avoid such algorithms in LG since our focus is on efficient algorithms at scale.

The Erdos-Renyi (modified) generator that we use for SimpleGraph(m, n) does something different: if the density exceeds 0.66, it creates the complement graph with (m, 2^m - n - 1)and then inverts it prior to returning to the user. But that's only because with E-R, the next edge doesn't depend on understanding the state of the entire system.

erlebach commented 4 years ago

Ok, got it. I really appreciate all the help. My code now runs quite efficiently, now that I understand the constraints. Using watts-strogatz always has a chance of stalling, correct? There are no parameter values below which it is guaranteed (in the mathematical sense) to perform? Once you reply to this last question, I think we can close this thread.

sbromberger commented 4 years ago

Using watts-strogatz always has a chance of stalling, correct?

Yes. The edges are sampled from a uniform distribution without replacement, so consider the (unlikely) case where you keep picking the same edge for your first selection. The construction will never terminate.

Now consider the case where you've asked for a large dense graph, and you're almost done. The probability that you will pick an edge you've already added approaches 1, yet you still haven't gotten all the edges you need. So you have to keep picking until you've gotten enough new random edges to satisfy your parameter selection. This could also never complete in a reasonable timeframe.

In general, unless otherwise specified, assume that any random graph generator is impacted by the parameters controlling its density.

There are no parameter values below which it is guaranteed (in the mathematical sense) to perform?

Yes, there is, but it's a pathological case and I include it only for completeness: if you use a random graph generator to create a zero-edge graph, it should be guaranteed to return (immediately, in most cases).

Actually, with Watts-Strogatz it's a little different, but the difference isn't really meaningful. The core issue - that adding edges iteratively in a random process is not guaranteed to terminate - still exists.

erlebach commented 4 years ago

Thanks. The explanation is very clear.i have to wonder if solutions can be found using permutations in some way. Nothing comes to mind, at least not if minimal cost is important, which it is.

sbromberger commented 4 years ago

Yup, that's exactly the dilemma. All the permutation functions I know of require an instantiated set, which in this case would be the edge set, which makes it memory-infeasible for large graphs.

I'm glad this is helpful. I've found that optimal graph algorithm efficiency is really tied to specific use cases; that's why we have so many different graph types with different backends (and different limitations).

sbromberger commented 4 years ago

In an effort to keep the issues list current and compact, I'll go ahead and close this out. @erlebach , if you have any other questions or feedback, please feel free to open up new issues as appropriate. Thanks.