snowleopard / alga

Algebraic graphs
MIT License
717 stars 68 forks source link

scc algorithm #235

Closed jitwit closed 4 years ago

jitwit commented 5 years ago

Implementation of scc algorithm for adjacency maps and adjacency int maps based on https://en.wikipedia.org/wiki/Path-based_strong_component_algorithm.

Avoids using Data.Graph, and benchmarks indicate that this pays off https://jitwit.github.io/criterion/scc-bench.html. In report, old-alga corresponds to alga's KL implementation, new-alga the intmap implementation, ord-alga the new implementation for AMs. Graphs used are the real-world ones from haskell-perf/graphs. AM implementation seems to be around 10% faster and AIM version around 65% faster.

There is no NonEmpty.AdjacencyIntMap module, so for now AdjacencyIntMap.Algorithm returns type AM.AdjacencyMap AdjacencyIntMap.

snowleopard commented 5 years ago

@jitwit The results are great! You are unstoppable :-)

scc is quite an important algorithm with many users, so perhaps we should have it in our regression suite to keep an eye on its performance. @nobrakal Do you think you could add it?

snowleopard commented 5 years ago

P.S.: I'll need some time to review the implementation due to upcoming travel.

jitwit commented 5 years ago

I was actually going to ask about how to add things to regression suite, since it would be useful for bfs too (using a different queue structure might be an opportunity for slightly better performance).

snowleopard commented 5 years ago

Here is where the performance testing script comes from:

https://github.com/snowleopard/alga/blob/35465faff04f0f2a608c5761f62cc65dee122b71/.travis.yml#L105

I guess we could move it to this repository to make changes more convenient.

@nobrakal What do you think?

nobrakal commented 5 years ago

@snowleopard

I am actually ashamed of this script, and currently, I think rewriting a very simple benchmarking suite and getting rid of https://github.com/haskell-perf/graphs would be the best option.

I would love to gave haskell-perf/graphs the rewriting it needs, but I don't have any time to do it now.

Anyway, adding scc seems very feasible and I can give it a try if you think it is the best idea :)

snowleopard commented 5 years ago

@nobrakal It's always possible to improve things but it's not always necessary :-) The current implementation may be not pretty but the script does work and is very useful!

If you could add scc (and perhaps bfs too) that would be great. And, of course, I'd be happy to have a prettier script when you find time for this :)

snowleopard commented 5 years ago

@jitwit There are some merge conflicts, please resolve.

jitwit commented 5 years ago

I re-ran the benchmarks and the results were less favorable in some cases for AdjacencyMaps. The graphs where the current version of the new implementation does poorly are those with a high number of small SCCs, the graphs were it does well have a small number of large SCCs. The AdjacencyIntMap version is much faster regardless.

The link to the criterion benchmarks from before has the current results.

snowleopard commented 5 years ago

@jitwit I'm looking at this link which corresponds to AdjacencyIntMap I think:

https://jitwit.github.io/criterion/scc-bench.html

This is better across all benchmarks, but can you show the results for AdjacencyMap? I'd like to have more performance info about the problematic cases you mention.

snowleopard commented 5 years ago

By the way, the performance regression suite reports scc: 1.10 (OK).

jitwit commented 5 years ago

Yeah those names are unclear, I changed them to KL-alga, AM-alga, AIM-alga.

I've looked closer, and the scc algorithm does perform well, the issue seems to be converting to the AM (NonEmptya a) representation afterwards. When there are many sccs, using induce is slow. When there are few sccs classifying edges (another approach I'm experimenting with) is slow.

The result of the scc search contains some additional information, so I was wondering if a hybrid approach might be feasible? For example, depending on the ratio of the number of sccs to the number of vertices, different functions could be used to convert to the final representation

jitwit commented 5 years ago

The benches are updated with the "hybrid" approach.

PS. apologies for the messy code, I was aiming to see if the approach would even pay off! PSS. According to the regression suite apparently not, but the benchmarks I ran on larger graphs look promising

jitwit commented 5 years ago

I tried the searches on a twitter graph from SNAP: https://snap.stanford.edu/data/ego-Twitter.html (~81k vertices ~1.7m edges) and the results are good:

https://jitwit.github.io/criterion/twitter-scc.html

snowleopard commented 5 years ago

@jitwit Very impressive, especially the Twitter benchmark! I think it's worth adding it to the benchmark suite at https://github.com/haskell-perf/graphs.

I'm concerned with the complexity of the resulting code though :( How much improvement is your hybrid approach bringing percentage-wise? If it's not too much then I think I'd prefer to remove the special tricks and instead focus on making a faster induce, if that's possible.

jitwit commented 5 years ago

The number of passes over the graph with induce increases with the number of sccs. I tried the algorithm with induce on the twitter graph and it was something like 5x slower. I think the right solution may be to ditch the hybrid approach as well as induce. Something like

partition :: Eq k => (a -> k) -> Graph a -> [Graph a]

is sort of what is needed.

There are still a few cases where the current approach is slower than the old implementation, but for the most part it's faster.

jitwit commented 5 years ago

I was also wondering how important is the AM (NonEmptyAM a) representation? What queries are most commonly asked of the result of scc?

jitwit commented 5 years ago

Ach, scc $ vertices xs isn't handled properly, I'll fix it soon (will also add the test case so that it always hits)

snowleopard commented 5 years ago

@jitwit

The number of passes over the graph with induce increases with the number of sccs. I tried the algorithm with induce on the twitter graph and it was something like 5x slower.

Oh, indeed, I didn't realise the current implementation is quadratic due to multiple calls to induce. This is certainly a bug! Can we call induce only once? I'm fine to introduce another primitive if it helps to solve this issue, although I'm not yet sure how partition helps. I think we need a way to induce a two-level structure in a graph without paying the cost of calling induce multiple times.

I was also wondering how important is the AM (NonEmptyAM a) representation?

I think the ideal type for scc is

scc :: Ord a => AdjacencyMap a -> Acyclic.AdjacencyMap (NonEmpty.AdjacencyMap a)

I love the ability to express the precise structure of scc in types and I hope we'll switch to this type at some point when we have a solid story for acyclic graphs.

For now, let's try to stick to AM (NonEmptyAM a): I think it's worth spending some time thinking how to fit an efficient engine behind a nice shiny API :)

snowleopard commented 5 years ago

Perhaps, we need the following variant of partition?

partition :: Eq k => (a -> k) -> Graph a -> Graph (NonEmpty.Graph a)
jitwit commented 4 years ago

Hi, sorry for the long pause!

Yeah, I'm trying to come up with a reasonable approach. The dfs g builds a mapping scc :: V -> Int which assigns vertices to components. I guess my main point with partition was that we're classifying over a larger type than Bool with induce.

My best shot so far is to do a fold Map.foldrWithKey to classify the edges and vertices followed gmaping by component id. Each edge (u,v) of g will either be included as an edge between components (scc u, scc v) or an edge within the component scc(u) (u,v). (I think a foldg would be useful, but it's in Algebra.Graph.ToGraph which imports the algorithms modules)

The current iteration does comparatively well on most graphs. The exception is graphs with only a few sccs.

https://jitwit.github.io/benchmarks/criterion/scc-bench.html

The induce approach did rather well on the graphs where the current approach does poorly, which was why I thought it may be best to still use it for some graphs. Being able to detect the case of 1 scc gives a rather large speedup to the current implementation (where we can just return AM.vertex g).

PS. updated results for social network graphs. Similar results as before:

https://jitwit.github.io/benchmarks/criterion/rw-scc-bench.html

jitwit commented 4 years ago

I think the main awkwardness is that having NonEmpty.AM as as vertices means that manipulating the outer Acyclic.AM map of the condensation is expensive. Returning aGraph (NonEmpthy.AM a) would probably be faster? I haven't tried it yet, though.

snowleopard commented 4 years ago

@jitwit Good to have you back! Sorry it took me a while to take a careful look at the recent changes.

The current iteration does comparatively well on most graphs. The exception is graphs with only a few sccs.

Being able to detect the case of 1 scc gives a rather large speedup.

My intuition tells me that the problem is not a small number of SCCs, but the large size of some SCCs. Am I right? As an example, consider a graph with 1000 vertices, with one large component (of size 900) and 100 small components (of size 1). I'm pretty sure that in this case the performance will still be low even though there are many SCCs. Could you double-check?

I think the main awkwardness is that having NonEmpty.AM as as vertices means that manipulating the outer Acyclic.AM map of the condensation is expensive.

Indeed, there is an issue about this: #141.

Returning a Graph (NonEmpthy.AM a) would probably be faster?

I wouldn't be surprised, however, the resulting Graph will need to be converted to an adjacency map at some point, which brings us back to the problem.

snowleopard commented 4 years ago

@jitwit Anyway, the benchmarking results are pretty good already! I wouldn't be strongly opposed to merging as is: I think we gain more than we sacrifice in terms of performance. However, if there is a way to speed up the remaining problematic cases that would be great.

jitwit commented 4 years ago

Your intuition seems to be right given the example vertices [0..1000] + circuit [1001..10000]. I'm not sure how to interpret the results for the same graph overlayed with [ edge x (x+1000) | x <- [0..1000]] though. The goal of the second was to see what happens in the case where there are also many edges between components as well.

https://jitwit.github.io/benchmarks/criterion/scc_misc.html

I went back and looked at the worst benchmark's graph, which has 10000 vertices, 9999 of which are in the same strongly connected component.

I wouldn't be surprised, however, the resulting Graph will need to be converted to an adjacency map at some point, which brings us back to the problem.

Makes sense!

snowleopard commented 4 years ago

@jitwit For the problematic case, could you isolate the computationally expensive part? As far as I understand, the traversal of the original graph is not overly expensive; the main cost is the final step, where we construct the resulting graph with heavy vertices. Is that right?

Returning to the partition function:

partition :: Eq k => (a -> k) -> Graph a -> Graph (NonEmpty.Graph a)

Do you think a fast implementation of this function would allows us to speed up the problematic case?

jitwit commented 4 years ago

I looked at a few of the graphs under +RTS -p (twitter, worst nodes10000, and n5). For each, the most expensive search computation is overlays inside the sccs, second most is traversing the graph by Map.foldrWithKey, and third is the dfs search.

overlays (inside sccs, not between) tends to take around 20-30+% of the overall computation. Worst case nodes10000 was above 30. Less than half of time is spent in dfs for each graph, and searching is light on allocations.

Building a list of vertices and edges and calling overlays afterwards was a fairly big win over building the graph by inserting each vertex and edge.

After looking into this, I realized difference lists might help. There's a modest improvement by switching to Endo [AM a] while condensing, but the effect wasn't as large as I hoped (around 0-5% improvement depending on graph).

jitwit commented 4 years ago

Full profiling results:

https://raw.githubusercontent.com/jitwit/bench-alga/master/nodes10000.prof https://raw.githubusercontent.com/jitwit/bench-alga/master/n5.prof https://raw.githubusercontent.com/jitwit/bench-alga/master/twitter.prof

These are the result of calling (more or less) print . AM.vertexCount . scc =<< read_graph "file". Reading done via unfoldr and ByteString.readInt accounted for around 1/10 of the whole computation.

jitwit commented 4 years ago

the main cost is the final step, where we construct the resulting graph with heavy vertices. Is that right?

Forgot to respond to questions. Yeah, it looks like this is right.

Since a good portion of the time is spent in overlays, I think it may be difficult to improve much on this current approach. One idea I could try is the building the Endo [AM a] s with Arrays instead of IntMaps. *

A good partition could be a way to improve, but I'm not sure how to implement it or how specifically it would help without further thought

* actually come to think of it, this wouldn't change much in the problematic case, it could be a general improvement though.

jitwit commented 4 years ago

Though within range of noise, the last two CI benches show k = 0.92 and 0.93 for scc -- likely thanks to difference lists!

snowleopard commented 4 years ago

@jitwit Thank you for further profiling!

A good partition could be a way to improve, but I'm not sure how to implement it or how specifically it would help without further thought

I had some ideas, let me try to remember and write them down.

actually come to think of it, this wouldn't change much in the problematic case, it could be a general improvement though.

Are you sure? I think we agreed that the problematic case is where we have components of large size, which probably means that the current implementation is non-linear in the size of components, i.e. instead of being O(|s1|+|s2|+...+|sK|), it is something like O(|s1|^2 + |s2|^2 + ... + |sK|^2), perhaps with some log factors here and there. My hope is to come up with a faster partition function that will not suffer from such non-linearity, which would mostly improve the problematic cases.

I conjecture that in non-problematic cases, partition (or whatever we currently have in its place) does not take a large proportion of time.

Though within range of noise, the last two CI benches show k = 0.92 and 0.93 for scc -- likely thanks to difference lists!

Nice! Note that we have a basic implementation of difference lists:

http://hackage.haskell.org/package/algebraic-graphs-0.4/docs/Algebra-Graph-Internal.html#t:List

Perhaps that might help make the code a bit easier to read. Could you reuse it?

jitwit commented 4 years ago

I hadn't realized there was the internal implementation, done! It's definitely clearer than writing Endo $ (++) xs, though it does seem to require some type annotations. Is it possible to get rid of them?

Also, right! I only had the specific nodes10000 benchmark in mind when I made that comment.

jitwit commented 4 years ago

I figured out a way to classify the edges inside the dfs search. The benchmarks for most graphs are improved.

The first idea is that for edge (u,v) that if v gets classified as part of a scc either before or during the visit to u, (u,v) is an outer edge in the condensation. Otherwise it's an inner edge.

One complication is knowing which components the inner edges belong to. The second idea is that the preorder number can be used to track which queued inner edges should be added to a scc (like it is for vertices).

It doesn't bridge the gap in the worst case benchmark, but it does decrease it. Before, the result was usually something like 220ms for current implementation vs 340 for new -- now I'm getting something like 220ms to 315ms.

jitwit commented 4 years ago

Twitter graph is from 10.4/6.9s to 8.8/4.4s for am/aim

jitwit commented 4 years ago

After the re-organizing, twitter down to 7.6 and 4.2 s!

Also to clarify why whether or not dfs v starts a new component determines if edge (u,v) is inner or outer, it's because sccs are selected based on the preorder number. On back edges, the boundary stack is popped to put the vertex with the lowest possible preorder number on top. If that's not v, then u is part of the same strongly connected component. If it is v, u is in a different one.

snowleopard commented 4 years ago

Twitter graph is from 10.4/6.9s to 8.8/4.4s for am/aim

After the re-organizing, twitter down to 7.6 and 4.2 s!

Whoa, that's great! Our CI performance suite also shows a modest improvement.

It doesn't bridge the gap in the worst case benchmark, but it does decrease it. Before, the result was usually something like 220ms for current implementation vs 340 for new -- now I'm getting something like 220ms to 315ms.

Is there any hope you can improve it further?

Sorry, I didn't have time to think about partition. I did look at the new implementation though and got lost pretty quickly :( The StateSCC datatype is scary. It would be nice to document it, perhaps by adding a large note in a single comment, explaining what's going on. (Unless the current state is just work in progress.)

jitwit commented 4 years ago

I started to add some comments to document the StateSCC data type. I opted to put them sort of in line with the fields in the type, but if it's preferable I can rearrange them to make the comments about it contiguous?

Would it be helpful to document more of the inner functions of the algorithm as well?

Is there any hope you can improve it further?

Possibly, though I think the current version is potentially near a local max (unless I've missed something silly). I could still try the idea of using an array to tabulate the inner graphs, but that would complicate the type of the state monad. Otherwise, coming up with a way to identify which situations should use induce to build the condensation (like in the last implementation) could do the trick. I don't know what reasonable heuristics would be, though.

(Unless the current state is just work in progress.)

I'm not sure if any of the fields are possible to eliminate without giving up some of the performance wins. The boundary stack stores vertices with their preorder ids to reduce the map lookups and the last three fields are used for constructing the condensation. The remaining fields are core to the Gabow's algorithm -- ints for giving vertices ids, tables for looking up the ids, and the two stacks.

snowleopard commented 4 years ago

@jitwit Many thanks! I'll give the current version a thorough review soon.

snowleopard commented 4 years ago

@jitwit Happy New Year! I reviewed some code, but I can't say I fully understand the implementation yet. Sorry for taking so long.

Given the complexity of the new implementation, I think we need to add a test comparing the new scc with the standard KL.scc. This would give me a lot more confidence, at least until I fully internalise the code.

jitwit commented 4 years ago

Hey, happy new year! Yeah, I wish the code was less convoluted. I'll try and find ways to make things a bit cleaner. Also adding tests to compare against KL.scc sounds good

jitwit commented 4 years ago

I sketched together some notes about the implementation. Hopefully they can help clarify the approach or highlight things needing improvement:

https://github.com/jitwit/alga-notes/blob/master/gabow.org

snowleopard commented 4 years ago

@jitwit Excellent notes! Thanks for doing this. I only got lost a bit where three vertices u, v and w are involved -- in those cases a diagram would help a lot :-)

Please link to your doc from the comments.

jitwit commented 4 years ago

Forgot to add links, will do in the next pass! I also made some kind of Diagram showing a pass of the algorithm on 3*1*4*1*5: https://raw.githubusercontent.com/jitwit/alga-notes/master/images/gabow.png gray means unvisited, red visited, blue assigned to a component. vertices labelled with vertex,preorder,scc triple. edges green or black if found to be an edge of the condensation.

snowleopard commented 4 years ago

@jitwit Thanks! The diagram does help.

I've made another pass with comments.

jitwit commented 4 years ago

I also added a CPP statement to get rid of the redundant Data.Monoid import warning for post 8.2 ghc

snowleopard commented 4 years ago

I also added a CPP statement to get rid of the redundant Data.Monoid import warning for post 8.2 ghc

I think you can just import Data.Semigroup ((<>)) instead, getting rid of CPP. I dropped support for GHC < 8 recently hoping to eliminate most occurrences of CPP.

As long as there are no warnings in GHC 8.6.5, I'm happy!

snowleopard commented 4 years ago

@jitwit I think I've noticed a few more places where we can save some time. Could you please have a look?

jitwit commented 4 years ago

Much better results!

https://jitwit.github.io/benchmarks/criterion/scc-bench.html

Just building the inner graphs (no difference lists) makes a big difference.

snowleopard commented 4 years ago

@jitwit Excellent! And the code got simpler too. Looks like we're mostly beating KL now, with only one scenario where we're slightly behind. I think we can tolerate this.

I think we can merge the PR now unless you'd like to do any further improvements.

jitwit commented 4 years ago

On larger graphs the performance gain is even bigger. An acyclic facebook network graph gets 200ms to 600ms (4000 vertices 90000 edges) and the twitter one gets 6.4s to 60s (80000 vertices 1,700,000 edges)

The part I'd maybe want to improve is the incomplete pattern on (curr,_:pth') = span (==v) pathStack. Any preference for how, extra helper, or using tail or something else?

snowleopard commented 4 years ago

200ms to 600ms [...] and 6.4s to 60s

@jitwit This is pretty cool! I guess when you're benchmarking KL you are including the overhead of converting an adjacency map to an array representation. What happens if you start with a graph stored in KL-style array? Is your implementation still significantly faster?

I don't really mind (curr,_:pth') = span (/=v) pth. I think going via an intermediate function would just obscure the invariant. One slight improvement is perhaps (curr,_v:pth') = span (/=v) pth, i.e. giving the unused binding a name.