sbromberger / LightGraphs.jl

An optimized graphs package for the Julia programming language
Other
672 stars 185 forks source link

proposal: Add LoopyGraph type and deprecate self-loops in SimpleGraph. #1080

Closed sbromberger closed 4 years ago

sbromberger commented 5 years ago

This has come up because #1079 results in a fix that changes a basic measurement, degree, from O(1) to O(log(deg)), just so we can return the correct results for undirected graphs with self-loops.

Self-loops also add inefficiency to adjacency_matrix.

The original impetus for adding them was a request from the QuantEcon folks, but I don't think they ever used this functionality.

It arguably goes against SimpleGraph philosophy to allow self-loops in any case.

Thoughts?

jpfairbanks commented 5 years ago

Well it would definitely be a 2.0 release if we got rid of self loops.

sbromberger commented 5 years ago

If we decide not to get rid of self-loops, then we need to make an effort to make them first-class citizens.

jpfairbanks commented 5 years ago

Yeah, I'm not sure that degree should be 2 for vertices with self loops. That would break the invariant

degree(v) == length(neighbors(g, v)). Maybe we have a new function loopdegree(g,v) which takes log(degv) time and gives the right answer.

sbromberger commented 5 years ago

so, adjacency_matrix treats self-loops with entry 2, and Wikipedia (yeah, I know, sue me) has this to say:

A special case is a loop, which adds two to the degree. This can be understood by letting each connection of the loop edge count as its own adjacent vertex. In other words, a vertex with a loop "sees" itself as an adjacent vertex from both ends of the edge thus adding two, not one, to the degree.

jpfairbanks commented 5 years ago

Does it make sense to have

struct LoopyGraph{G, L} <: AbstractGraph
    g::G
    loops::L
end

function LoopyGraph{G, Vector{Bool}}(g::G) where G <: AbstractGraph
   return LoopyGraph(g, zeros(Bool, nv(g))
end

function add_edge!(g, v,u)
    if v == u
        g.loops[v] = true
    else
        add_edge!(g.g, v,u)
    end
end

Then LoopyGraph will take the loops as first class citizens and get the degree and everything else right.

And SimpleGraph doesn't have to worry about it. We can still allow them in SimpleGraph, in order to avoid breaking the API, but the degree will be the +1 for a loop in a SimpleGraph and +2 in a LoopyGraph. Our API contract around loops in SimpleGraph is, "you can put loops in there if you need to, but don't be surprised if any algorithm is wrong for loops".

I think this gives us the best of both worlds, no need to break compatibility for SimpleGraph and a place for algorithms that take loops seriously to anchor on the type system. We also get the benefit that removing all loops from a graph is a constant time operation striploops(g::LoopyGraph) = g.g.

What do you think?

sbromberger commented 5 years ago

I was thinking about a separate graph type for self loops and came up with a similar data structure (I was thinking L could be a bitset or a standard set; we'd need O(1) membership check), but it doesn't solve the underlying issue that self-loops are tolerated but second-class citizens in SimpleGraphs.

Adding this type will help in some ways but will sow confusion in others. If we do this then my feeling is that we have to explicitly prohibit self-loops in SimpleGraphs. The other reason we'd want to do this is practical: if we allow pseudoloops in g.g, then we're setting ourselves up for some very strange behavior.

Let's keep the discussions separate for now, since I don't know that the presence or absence of LoopyGraph makes much of a difference in our decision whether or not to strengthen or abandon loops in SimpleGraph. (Until recently, there's been no real evidence of anyone actually using self-loops for critical work since the issues that have come up should've come up earlier otherwise.)

jpfairbanks commented 5 years ago

I think we can

  1. Deprecate Self Loops in SimpleGraph
  2. Add LoopyGraph{G,L}
  3. Update the docs to include the deprecation and new type
  4. Tag a new minor version

Then when we are ready to go for 2.0, we can make self loops in SimpleGraph an exception.

On the choice of L, I think it makes sense to leave that generic because some users will have graphs with mostly no loops, and some with have graphs with many loops, and sometimes you will want to iterate over all the vertices with loops in increasing order or something like that. I can see the use cases for Vector{Bool}, Set{Int}, Bitvector, and Vector{Int}.

sbromberger commented 5 years ago

Agree on all points.

jpfairbanks commented 5 years ago

I'll edit the title to reflect this consensus.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

rodrigolece commented 5 years ago

I'm joining the conversation a bit late but this is something that interests a lot. I have raised a separate issue before that is related to self-edges (JuliaGraphs/SimpleWeightedGraphs.jl#41).

From what I understand (correct me if I'm wrong) the defining characteristic of SimpleGraphs.jl is that you want the package to be lightweight and this is why you leave 'features' like weight outside of it. IMHO allowing for self-edges comes at no extra cost and it adds the functionality that most people expect.

As correctly pointed out before, for undirected graphs the consensus is that self-edges are counted as add two separate stubs joined to the vertex, and apart from that they really don't have nothing special and I would include them in a package called SimpleGraphs because they come at no extra cost. This leads us to the convention that we store twice the number of edges in the diagonal entry of the adjacency matrix. If we use that convention all of the mathematical operations that involve taking sums of the adjacency matrix are consistent, and for example this includes the degree (which we would expect to be two #1079).

What I found really surprising is that the adjacency matrix already follows this convention.

julia> g = SimpleGraph(1)
{1, 0} undirected simple Int64 graph

julia> add_edge!(g, 1, 1)
true

julia> adjacency_matrix(g)
1×1 SparseArrays.SparseMatrixCSC{Int64,Int64} with 1 stored entry:
  [1, 1]  =  2

There was a discussion before about this braking an invariant, but that is based on a wrong definition of degree. The degree of a node is defined as the sum (column-wise or row-wise, the matrix is symmetric) of the adjacency matrix and therefore the degree of a node with a self-edge should be two. This is also what most practitioners expect and is the behaviour in igraph and NetworkX (although in igraph the adjacency matrix doesn't have a 2 in the diagonal entries).

In summary, I would be against deprecating self-edges and I would be for changing the definition of degree. Obviously, that is just me.

sbromberger commented 5 years ago

IMHO allowing for self-edges comes at no extra cost

But this is demonstrably untrue as just one example. Depending on how you measure cost, the multiple issues filed over the span of the years since we implemented them count as another real cost (to developers).

Some history may be in order: Simple graphs (at least as they're traditionally defined) don't allow for multiple edges or self loops. We made an exception to the latter a few years ago due to some work being done in the Bio.jl program. It turns out that 1) it's much harder to implement self-loops than it originally appears, since all sorts of corner cases now emerge, and 2) they wound up not needing the self-loop functionality after all.

Self-loops have always therefore been in this twilight zone where we've got structural support for them, but none of the algorithms are really designed to take advantage of them. It's this inconsistency, the difficulty of going back to verify that self-loops work in the hundreds of functions we have, and the fact that few people really use them, that makes it appealing to me just to declare that our SimpleGraph struct does indeed adhere to the traditional definition.

Use self loops right now at your own risk. Bugs reported in LightGraphs that only manifest with self loops will be deprioritized.

rodrigolece commented 5 years ago

Okay, that clarification was important. Yes, a simple graph is defined to be a graph with no multiedges and no self-edges. I thought the main distinction you were making in the ecosystem was about weighted (for which you would use SimpleWeightedGraphs) vs unweighted (LightGraphs), now I know this is not the case.

I understand that you have limited resources and yes, I imagine how difficult it would be to go back and verify that all of the algorithms work. However, I'm not sure I agree with

the fact that few people really use them

is true. They are important for the people in my field (hence why I'm here), community detection in networks, because you seek for a birds-eye view representation of networks and self-edges correspond to edges inside the community. They are important in mobility networks (commuters don't necessarily travel to a different city) and they are also crucially important inside the configuration models (in plural, and this is at the root of the the discussion of #1208 which is now sadly but perhaps correctly locked).

Regarding the lines of code that you refer to, I will now look into it but I don't think that an example of an algorithm that becomes inefficient because you have a 2 in the diagonal is a good reason for scrapping self-edges altogether. As I said before, the degree of a vertex with a self-edge (and nothing else) is equal to 2. There is no way around that, and I think that we should all strive together to make sure that self-edges are consistently implemented like they are in other packages (igraph and NetworkX).

I understand that you have limited resources and if I am able I would love to contribute and help you deal with this issue properly (my expertise is in the mathematics of networks more than software developing so you would need to be patient with me).

I think a good starting place would be to fix the degree?

rodrigolece commented 5 years ago

BTW I meant that it comes at no extra cost because the adjacency matrix already has a 2 in the diagonal. That allows you for a straightforward implementation of the degree (the sum of entries along a row or a column). I agree that running time is another way to measure cost and I didn't want to be dismissive about the effort you have put in as developers.

sbromberger commented 5 years ago

I think a good starting place would be to fix the degree?

In my mind, there's the "quick fix" and the "correct fix". What we've been doing is a series of "quick fixes" which may make one person's code work, but does nothing for others, and may introduce regressions throughout the codebase. "Fixing the degree" is an example of a "quick fix" - because degree is a fundamental operation, we need to make sure that any change to it won't impact anything else.

The "correct fix" is to decide to promote self-loopiness to first-class support, and ensure that tests are written to represent self looped graphs in such a way that errors show up for cases where the functions aren't yet prepared to deal with them. This is a lot of work for a small user base and, so far, nobody's come forth to say, "This is important to me so I'll do it."

If it's important enough to you, then let's sketch out a plan for making self-loops fully supported. I'm still not convinced that they shouldn't have their own graph type (per the original proposal here), but if you've got enough stamina to see this through, then we should definitely talk.

rodrigolece commented 5 years ago

I understand your point of view and I think I generally agree with you. The correct fix for me would also be to promote self-edges.

I think I'm interested enough in Julia and networks that I want to give it a go. It is true that I've never really contributed to a codebase and it might take me some time and effort to get up to speed, but I can definitely give it a try. Let's carry on with the conversation and let's try to sketch a plan together :)

sbromberger commented 4 years ago

@rodrigolece Looping back to this (pun intended): if you want to take a shot at this, the best thing to do (IMO) is to copy the SimpleGraph/SimpleDiGraph code into a new graph type (LoopyGraph seems like it was the preferred name but we can bikeshed later) and then try to get self-loops working on all the algorithms.

Reopening for further discussion. If you're no longer interested/available, let me know and I'll close it out. Thanks!

rodrigolece commented 4 years ago

@sbromberger I'm on it! Let's see how far along I can get over this coming Christmas break.

sbromberger commented 4 years ago

Awesome! Good luck and feel free to reach out on github or on Slack if you need help or want to share ideas.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.