Closed jitwit closed 4 years ago
@jitwit The results are great! You are unstoppable :-)
scc
is quite an important algorithm with many users, so perhaps we should have it in our regression suite to keep an eye on its performance. @nobrakal Do you think you could add it?
P.S.: I'll need some time to review the implementation due to upcoming travel.
I was actually going to ask about how to add things to regression suite, since it would be useful for bfs
too (using a different queue structure might be an opportunity for slightly better performance).
Here is where the performance testing script comes from:
https://github.com/snowleopard/alga/blob/35465faff04f0f2a608c5761f62cc65dee122b71/.travis.yml#L105
I guess we could move it to this repository to make changes more convenient.
@nobrakal What do you think?
@snowleopard
I am actually ashamed of this script, and currently, I think rewriting a very simple benchmarking suite and getting rid of https://github.com/haskell-perf/graphs would be the best option.
I would love to gave haskell-perf/graphs
the rewriting it needs, but I don't have any time to do it now.
Anyway, adding scc
seems very feasible and I can give it a try if you think it is the best idea :)
@nobrakal It's always possible to improve things but it's not always necessary :-) The current implementation may be not pretty but the script does work and is very useful!
If you could add scc
(and perhaps bfs
too) that would be great. And, of course, I'd be happy to have a prettier script when you find time for this :)
@jitwit There are some merge conflicts, please resolve.
I re-ran the benchmarks and the results were less favorable in some cases for AdjacencyMaps. The graphs where the current version of the new implementation does poorly are those with a high number of small SCCs, the graphs were it does well have a small number of large SCCs. The AdjacencyIntMap version is much faster regardless.
The link to the criterion benchmarks from before has the current results.
@jitwit I'm looking at this link which corresponds to AdjacencyIntMap
I think:
https://jitwit.github.io/criterion/scc-bench.html
This is better across all benchmarks, but can you show the results for AdjacencyMap
? I'd like to have more performance info about the problematic cases you mention.
By the way, the performance regression suite reports scc: 1.10 (OK)
.
Yeah those names are unclear, I changed them to KL-alga, AM-alga, AIM-alga.
I've looked closer, and the scc
algorithm does perform well, the issue seems to be converting to the AM (NonEmptya a)
representation afterwards. When there are many sccs, using induce
is slow. When there are few sccs
classifying edges (another approach I'm experimenting with) is slow.
The result of the scc
search contains some additional information, so I was wondering if a hybrid approach might be feasible? For example, depending on the ratio of the number of sccs to the number of vertices, different functions could be used to convert to the final representation
The benches are updated with the "hybrid" approach.
PS. apologies for the messy code, I was aiming to see if the approach would even pay off! PSS. According to the regression suite apparently not, but the benchmarks I ran on larger graphs look promising
I tried the searches on a twitter graph from SNAP: https://snap.stanford.edu/data/ego-Twitter.html (~81k vertices ~1.7m edges) and the results are good:
@jitwit Very impressive, especially the Twitter benchmark! I think it's worth adding it to the benchmark suite at https://github.com/haskell-perf/graphs.
I'm concerned with the complexity of the resulting code though :( How much improvement is your hybrid approach bringing percentage-wise? If it's not too much then I think I'd prefer to remove the special tricks and instead focus on making a faster induce
, if that's possible.
The number of passes over the graph with induce
increases with the number of sccs. I tried the algorithm with induce
on the twitter graph and it was something like 5x slower. I think the right solution may be to ditch the hybrid approach as well as induce. Something like
partition :: Eq k => (a -> k) -> Graph a -> [Graph a]
is sort of what is needed.
There are still a few cases where the current approach is slower than the old implementation, but for the most part it's faster.
I was also wondering how important is the AM (NonEmptyAM a)
representation? What queries are most commonly asked of the result of scc
?
Ach, scc $ vertices xs
isn't handled properly, I'll fix it soon (will also add the test case so that it always hits)
@jitwit
The number of passes over the graph with
induce
increases with the number of sccs. I tried the algorithm withinduce
on the twitter graph and it was something like 5x slower.
Oh, indeed, I didn't realise the current implementation is quadratic due to multiple calls to induce
. This is certainly a bug! Can we call induce
only once? I'm fine to introduce another primitive if it helps to solve this issue, although I'm not yet sure how partition
helps. I think we need a way to induce a two-level structure in a graph without paying the cost of calling induce
multiple times.
I was also wondering how important is the
AM (NonEmptyAM a)
representation?
I think the ideal type for scc
is
scc :: Ord a => AdjacencyMap a -> Acyclic.AdjacencyMap (NonEmpty.AdjacencyMap a)
I love the ability to express the precise structure of scc
in types and I hope we'll switch to this type at some point when we have a solid story for acyclic graphs.
For now, let's try to stick to AM (NonEmptyAM a)
: I think it's worth spending some time thinking how to fit an efficient engine behind a nice shiny API :)
Perhaps, we need the following variant of partition
?
partition :: Eq k => (a -> k) -> Graph a -> Graph (NonEmpty.Graph a)
Hi, sorry for the long pause!
Yeah, I'm trying to come up with a reasonable approach. The dfs g
builds a mapping scc :: V -> Int
which assigns vertices to components. I guess my main point with partition
was that we're classifying over a larger type than Bool
with induce
.
My best shot so far is to do a fold Map.foldrWithKey
to classify the edges and vertices followed gmap
ing by component id. Each edge (u,v)
of g
will either be included as an edge between components (scc u, scc v)
or an edge within the component scc(u)
(u,v)
. (I think a foldg
would be useful, but it's in Algebra.Graph.ToGraph
which imports the algorithms modules)
The current iteration does comparatively well on most graphs. The exception is graphs with only a few sccs.
https://jitwit.github.io/benchmarks/criterion/scc-bench.html
The induce
approach did rather well on the graphs where the current approach does poorly, which was why I thought it may be best to still use it for some graphs. Being able to detect the case of 1 scc
gives a rather large speedup to the current implementation (where we can just return AM.vertex g
).
PS. updated results for social network graphs. Similar results as before:
https://jitwit.github.io/benchmarks/criterion/rw-scc-bench.html
I think the main awkwardness is that having NonEmpty.AM a
s as vertices means that manipulating the outer Acyclic.AM
map of the condensation is expensive. Returning aGraph (NonEmpthy.AM a)
would probably be faster? I haven't tried it yet, though.
@jitwit Good to have you back! Sorry it took me a while to take a careful look at the recent changes.
The current iteration does comparatively well on most graphs. The exception is graphs with only a few sccs.
Being able to detect the case of 1 scc gives a rather large speedup.
My intuition tells me that the problem is not a small number of SCCs, but the large size of some SCCs. Am I right? As an example, consider a graph with 1000 vertices, with one large component (of size 900) and 100 small components (of size 1). I'm pretty sure that in this case the performance will still be low even though there are many SCCs. Could you double-check?
I think the main awkwardness is that having
NonEmpty.AM
as as vertices means that manipulating the outerAcyclic.AM
map of the condensation is expensive.
Indeed, there is an issue about this: #141.
Returning a
Graph (NonEmpthy.AM a)
would probably be faster?
I wouldn't be surprised, however, the resulting Graph
will need to be converted to an adjacency map at some point, which brings us back to the problem.
@jitwit Anyway, the benchmarking results are pretty good already! I wouldn't be strongly opposed to merging as is: I think we gain more than we sacrifice in terms of performance. However, if there is a way to speed up the remaining problematic cases that would be great.
Your intuition seems to be right given the example vertices [0..1000] + circuit [1001..10000]
. I'm not sure how to interpret the results for the same graph overlayed with [ edge x (x+1000) | x <- [0..1000]]
though. The goal of the second was to see what happens in the case where there are also many edges between components as well.
https://jitwit.github.io/benchmarks/criterion/scc_misc.html
I went back and looked at the worst benchmark's graph, which has 10000 vertices, 9999 of which are in the same strongly connected component.
I wouldn't be surprised, however, the resulting Graph will need to be converted to an adjacency map at some point, which brings us back to the problem.
Makes sense!
@jitwit For the problematic case, could you isolate the computationally expensive part? As far as I understand, the traversal of the original graph is not overly expensive; the main cost is the final step, where we construct the resulting graph with heavy vertices. Is that right?
Returning to the partition
function:
partition :: Eq k => (a -> k) -> Graph a -> Graph (NonEmpty.Graph a)
Do you think a fast implementation of this function would allows us to speed up the problematic case?
I looked at a few of the graphs under +RTS -p
(twitter, worst nodes10000, and n5). For each, the most expensive search computation is overlays
inside the sccs, second most is traversing the graph by Map.foldrWithKey, and third is the dfs search.
overlays
(inside sccs, not between) tends to take around 20-30+% of the overall computation. Worst case nodes10000 was above 30. Less than half of time is spent in dfs for each graph, and searching is light on allocations.
Building a list of vertices and edges and calling overlays afterwards was a fairly big win over building the graph by inserting each vertex and edge.
After looking into this, I realized difference lists might help. There's a modest improvement by switching to Endo [AM a]
while condensing, but the effect wasn't as large as I hoped (around 0-5% improvement depending on graph).
Full profiling results:
https://raw.githubusercontent.com/jitwit/bench-alga/master/nodes10000.prof https://raw.githubusercontent.com/jitwit/bench-alga/master/n5.prof https://raw.githubusercontent.com/jitwit/bench-alga/master/twitter.prof
These are the result of calling (more or less) print . AM.vertexCount . scc =<< read_graph "file"
. Reading done via unfoldr
and ByteString.readInt
accounted for around 1/10 of the whole computation.
the main cost is the final step, where we construct the resulting graph with heavy vertices. Is that right?
Forgot to respond to questions. Yeah, it looks like this is right.
Since a good portion of the time is spent in overlays, I think it may be difficult to improve much on this current approach. One idea I could try is the building the Endo [AM a]
s with Arrays instead of IntMaps. *
A good partition
could be a way to improve, but I'm not sure how to implement it or how specifically it would help without further thought
* actually come to think of it, this wouldn't change much in the problematic case, it could be a general improvement though.
Though within range of noise, the last two CI benches show k =
0.92 and 0.93 for scc
-- likely thanks to difference lists!
@jitwit Thank you for further profiling!
A good
partition
could be a way to improve, but I'm not sure how to implement it or how specifically it would help without further thought
I had some ideas, let me try to remember and write them down.
actually come to think of it, this wouldn't change much in the problematic case, it could be a general improvement though.
Are you sure? I think we agreed that the problematic case is where we have components of large size, which probably means that the current implementation is non-linear in the size of components, i.e. instead of being O(|s1|+|s2|+...+|sK|), it is something like O(|s1|^2 + |s2|^2 + ... + |sK|^2), perhaps with some log factors here and there. My hope is to come up with a faster partition
function that will not suffer from such non-linearity, which would mostly improve the problematic cases.
I conjecture that in non-problematic cases, partition
(or whatever we currently have in its place) does not take a large proportion of time.
Though within range of noise, the last two CI benches show k = 0.92 and 0.93 for scc -- likely thanks to difference lists!
Nice! Note that we have a basic implementation of difference lists:
http://hackage.haskell.org/package/algebraic-graphs-0.4/docs/Algebra-Graph-Internal.html#t:List
Perhaps that might help make the code a bit easier to read. Could you reuse it?
I hadn't realized there was the internal implementation, done! It's definitely clearer than writing Endo $ (++) xs
, though it does seem to require some type annotations. Is it possible to get rid of them?
Also, right! I only had the specific nodes10000
benchmark in mind when I made that comment.
I figured out a way to classify the edges inside the dfs search. The benchmarks for most graphs are improved.
The first idea is that for edge (u,v)
that if v
gets classified as part of a scc
either before or during the visit to u
, (u,v)
is an outer edge in the condensation. Otherwise it's an inner edge.
One complication is knowing which components the inner edges belong to. The second idea is that the preorder number can be used to track which queued inner edges should be added to a scc (like it is for vertices).
It doesn't bridge the gap in the worst case benchmark, but it does decrease it. Before, the result was usually something like 220ms for current implementation vs 340 for new -- now I'm getting something like 220ms to 315ms.
Twitter graph is from 10.4/6.9s to 8.8/4.4s for am/aim
After the re-organizing, twitter down to 7.6 and 4.2 s!
Also to clarify why whether or not dfs v
starts a new component determines if edge (u,v) is inner or outer, it's because sccs are selected based on the preorder number. On back edges, the boundary stack is popped to put the vertex with the lowest possible preorder number on top. If that's not v
, then u
is part of the same strongly connected component. If it is v
, u
is in a different one.
Twitter graph is from 10.4/6.9s to 8.8/4.4s for am/aim
After the re-organizing, twitter down to 7.6 and 4.2 s!
Whoa, that's great! Our CI performance suite also shows a modest improvement.
It doesn't bridge the gap in the worst case benchmark, but it does decrease it. Before, the result was usually something like 220ms for current implementation vs 340 for new -- now I'm getting something like 220ms to 315ms.
Is there any hope you can improve it further?
Sorry, I didn't have time to think about partition
. I did look at the new implementation though and got lost pretty quickly :( The StateSCC
datatype is scary. It would be nice to document it, perhaps by adding a large note in a single comment, explaining what's going on. (Unless the current state is just work in progress.)
I started to add some comments to document the StateSCC
data type. I opted to put them sort of in line with the fields in the type, but if it's preferable I can rearrange them to make the comments about it contiguous?
Would it be helpful to document more of the inner functions of the algorithm as well?
Is there any hope you can improve it further?
Possibly, though I think the current version is potentially near a local max (unless I've missed something silly). I could still try the idea of using an array to tabulate the inner graphs, but that would complicate the type of the state monad. Otherwise, coming up with a way to identify which situations should use induce
to build the condensation (like in the last implementation) could do the trick. I don't know what reasonable heuristics would be, though.
(Unless the current state is just work in progress.)
I'm not sure if any of the fields are possible to eliminate without giving up some of the performance wins. The boundary stack stores vertices with their preorder ids to reduce the map lookups and the last three fields are used for constructing the condensation. The remaining fields are core to the Gabow's algorithm -- ints for giving vertices ids, tables for looking up the ids, and the two stacks.
@jitwit Many thanks! I'll give the current version a thorough review soon.
@jitwit Happy New Year! I reviewed some code, but I can't say I fully understand the implementation yet. Sorry for taking so long.
Given the complexity of the new implementation, I think we need to add a test comparing the new scc
with the standard KL.scc
. This would give me a lot more confidence, at least until I fully internalise the code.
Hey, happy new year! Yeah, I wish the code was less convoluted. I'll try and find ways to make things a bit cleaner. Also adding tests to compare against KL.scc
sounds good
I sketched together some notes about the implementation. Hopefully they can help clarify the approach or highlight things needing improvement:
@jitwit Excellent notes! Thanks for doing this. I only got lost a bit where three vertices u
, v
and w
are involved -- in those cases a diagram would help a lot :-)
Please link to your doc from the comments.
Forgot to add links, will do in the next pass! I also made some kind of Diagram
showing a pass of the algorithm on 3*1*4*1*5
:
https://raw.githubusercontent.com/jitwit/alga-notes/master/images/gabow.png
gray means unvisited, red visited, blue assigned to a component. vertices labelled with vertex,preorder,scc triple. edges green or black if found to be an edge of the condensation.
@jitwit Thanks! The diagram does help.
I've made another pass with comments.
I also added a CPP statement to get rid of the redundant Data.Monoid
import warning for post 8.2 ghc
I also added a CPP statement to get rid of the redundant
Data.Monoid
import warning for post 8.2 ghc
I think you can just import Data.Semigroup ((<>))
instead, getting rid of CPP. I dropped support for GHC < 8 recently hoping to eliminate most occurrences of CPP.
As long as there are no warnings in GHC 8.6.5, I'm happy!
@jitwit I think I've noticed a few more places where we can save some time. Could you please have a look?
Much better results!
https://jitwit.github.io/benchmarks/criterion/scc-bench.html
Just building the inner graphs (no difference lists) makes a big difference.
@jitwit Excellent! And the code got simpler too. Looks like we're mostly beating KL now, with only one scenario where we're slightly behind. I think we can tolerate this.
I think we can merge the PR now unless you'd like to do any further improvements.
On larger graphs the performance gain is even bigger. An acyclic facebook network graph gets 200ms to 600ms (4000 vertices 90000 edges) and the twitter one gets 6.4s to 60s (80000 vertices 1,700,000 edges)
The part I'd maybe want to improve is the incomplete pattern on (curr,_:pth') = span (==v) pathStack
. Any preference for how, extra helper, or using tail
or something else?
200ms to 600ms [...] and 6.4s to 60s
@jitwit This is pretty cool! I guess when you're benchmarking KL you are including the overhead of converting an adjacency map to an array representation. What happens if you start with a graph stored in KL-style array? Is your implementation still significantly faster?
I don't really mind (curr,_:pth') = span (/=v) pth
. I think going via an intermediate function would just obscure the invariant. One slight improvement is perhaps (curr,_v:pth') = span (/=v) pth
, i.e. giving the unused binding a name.
Implementation of scc algorithm for adjacency maps and adjacency int maps based on https://en.wikipedia.org/wiki/Path-based_strong_component_algorithm.
Avoids using Data.Graph, and benchmarks indicate that this pays off https://jitwit.github.io/criterion/scc-bench.html. In report, old-alga corresponds to alga's KL implementation, new-alga the intmap implementation, ord-alga the new implementation for AMs. Graphs used are the real-world ones from haskell-perf/graphs. AM implementation seems to be around 10% faster and AIM version around 65% faster.
There is no NonEmpty.AdjacencyIntMap module, so for now AdjacencyIntMap.Algorithm returns type
AM.AdjacencyMap AdjacencyIntMap
.