haskell / containers

Assorted concrete container types
https://hackage.haskell.org/package/containers
318 stars 178 forks source link

Data.Graph: detect cycles utility functions #978

Open gwern opened 11 months ago

gwern commented 11 months ago

It would be helpful if Data.Graph provided utility functions for detecting cycles in graphs, which may be problematic and represent infinite loops. (As there is no standard 'utility' module I see for Data.Graph, it would be a helper function for the SCC part.)


I use some sets of rewrite rules like 'A->B', where, due to changes elsewhere and updates over many years, they can inadvertently wind up defining a non-obvious cycle like 'A->B->C->A' which would loop infinitely. These are nasty surprises when they surface, so I look into detecting cycles in the graph defined by sets of rewrite rules. This has gotten me some utility functions of the form:

isCycleLess :: (Eq a, Ord a, Show a) => [(a,a)] -> [(a,a)]
isCycleLess xs = if not (cycleExists xs) then xs else error "Association list of rewrite-rules has cycles! Errors related to:" (show $ findCycles xs)

cycleExists :: Ord a => [(a, a)] -> Bool
cycleExists tuples = any (uncurry (==)) tuples ||
    -- There's a cycle if one of the strongly connected components has more than one node
    any ((> 1) . length . flattenSCC)
       -- Generate strongly connected components from edges
       (stronglyConnComp $
        -- Create edges by converting a tuple (a, b) to (a, b, [b]) to reflect a -> b
        map (\(a, b) -> (a, a, [b])) tuples)

-- *Which* rewrite rules are responsible for an infinite loop? Here's one way to find bad nodes easily (albeit inefficiently):
-- start with the list of rewrites and two empty temporary lists;
-- from the rewrite list, take & incrementally add rules to the first list if they do not create a cycle in the first list;
-- if they do, add them to the second list instead (otherwise ignoring the second list);
-- when all rules are used up, return the second list. Those are the bad rules.
findCycles :: Ord a => [(a, a)] -> [(a, a)]
findCycles xs = snd $ foldl f ([], []) xs
  where
    f (good, bad) rule
      | cycleExists (rule : good) = (good, rule : bad)
      | otherwise = (rule : good, bad)

-- `cycleExists` testsuite:
testCycleExists :: [([(Int,Int)], Bool)] -> [[(Int,Int)]]
testCycleExists testCases = [ rules | (rules, expected) <- testCases, cycleExists rules /= expected]
testCycleDetection :: [[(Int,Int)]]
testCycleDetection = testCycleExists testCases
 where testCases :: [([(Int, Int)], Bool)]
       testCases = [ ([], False) -- no rules, no cycles
            , ([(1, 2)], False) -- one rule, no cycles
            , ([(1, 1)], True), ([(1, 2), (2, 3), (3, 4), (5, 5)], True), ([(1, 2), (2, 3), (4, 4), (5, 6)], True) -- self loop
            , ([(1, 2), (2, 3), (3, 4)], False) -- rules with no cycles
            , ([(1, 2), (2, 1)], True) -- simple cycle
            , ([(1, 2), (2, 3), (3, 1)], True) -- cycle with more than 2 nodes: where there is a cycle of nodes that all point to one another, but no node points to itself
            , ([(1, 2), (2, 3), (3, 4), (4, 1)], True) -- larger cycle
            , ([(1, 2), (2, 1), (3, 4), (4, 3), (5, 6), (6, 5)], True) -- Multiple disjoint cycles within a larger rule set
            , ([(1, 2), (1, 3), (2, 4), (2, 5), (3, 6), (3, 7)], False)
            , ([(1, 2), (2, 3), (4, 5), (5, 6)], False) -- separate set of rules, no cycles
            , ([(1, 2), (2, 3), (3, 1), (4, 5), (5, 6), (6, 4)], True) -- separate set of rules with cycles
            , ([(1, 2), (2, 3), (3, 2), (4, 5), (5, 4)], True) -- there is a cycle within subset of rules
            , ([(1, 2), (3, 4), (5, 6)], False) -- separate set of rules, no cycles
            , ([(1, 2), (1, 2), (2, 3), (2, 3)], False) -- repetition
            , ([(1, 2), (1, 3), (2, 4), (3, 4)], False) -- Multiple paths to the same node, but no cycles
            , ([(1, 2), (1, 3), (2, 4), (3, 4), (4, 1)], True) -- where there are multiple paths leading to a node that is part of a cycle.
            , ([(1, 1), (2, 2), (3, 3)], True) --where every node in the list points to itself (simple loop for every node)
            ]

Which turned up plenty of latent infinite loops, and thus far, has not missed any new infinite loops in the months I've had it.

It's not exactly obvious how you'd turn your convenient little rewrite rule list into a proper graph cycle-detection when you've never used Data.Graph before, even if the resulting code using stronglyConnComp+flattenSCC is pretty short and doubtless seems trivial to a Data.Graph savant, so I think it'd be nice to wrap that up as a utility function of some sort like cycles :: [vertex] -> [[vertex]], which would return a list of cycles (where the sublist is length > 1).

This would then be very easy to check graphs for issues and support other use-cases where a list of cycles might be useful.

(The helper functions isCycleLess and findCycles might not be useful enough to include, but the test-suite should probably be kept in some form to avoid regressions and define expected behavior.)

meooow25 commented 11 months ago

It's not clear to me what exactly should be offered. I don't follow what the input of your suggested cycles :: [vertex] -> [[vertex]] would be. If by "detecting cycles in graphs" you mean check whether a graph has a cycle, then that would be Graph -> Bool and can be done more efficiently than going through SCC. I wouldn't mind seeing that (as another user, I'm not a maintainer). Listing cycles (Graph -> [[Vertex]]) would likely be of limited use since there can be an exponential number of them. Finding some cycle (Graph -> Maybe [Vertex]) might be another useful function.

However, your use case seems to be a specific case of graphs where every vertex has at most one outgoing edge (directed pseudoforest), since you wouldn't have two rules A->B, A->C.
For such a graph it is straightforward to list the cycles (though I'm unsure if this is considered an implementation detail because the documentation of stronglyConnComp doesn't specify the order of vertices in an SCC).

cycles :: Ord a => [(a,a)] -> [[a]]
cycles edges = [xs | CyclicSCC xs <- stronglyConnComp (map (\(a, b) -> (a, a, [b])) edges)]
-- cycles [(1,2),(2,3),(3,1),(4,5),(5,6),(6,5)] = [[5,6],[1,2,3]]
-- [5,6] means the cycle is 5->6->5
-- [1,2,3] means the cycle is 1->2->3->1

Also, your findCycles picks only the edge which completes a cycle as "bad", though removing any edge in the cycle would get rid of it. Maybe this is not a problem, but I thought I would mention it.

gwern commented 11 months ago

Listing cycles (Graph -> [[Vertex]]) would likely be of limited use since there can be an exponential number of them.

My thinking there was that it's easy to turn the [[Vertex]] into a Bool, but not vice-versa: if you simply want the Bool, you can define it trivially by the list being non-empty (and then laziness should make it about as efficient, since you will stop calculating as soon as the first sublist starts evaluating, and also the possible exponential count is not a big deal if it's generated lazily?). And base libraries should prefer general utility functions which can easily be specialized to highly-specialized ones. So that's why the [[Vertex]] suggestion seemed more Haskell-y to me. If you don't agree and think any utility function should be the Bool, that's fine. I'm just trying to make Data.Graph more useful for other Haskellers.

However, your use case seems to be a specific case of graphs where every vertex has at most one outgoing edge (directed pseudoforest), since you wouldn't have two rules A->B, A->C.

Yes, in most of these cases I think I would regard two such rules as ambiguous and bad: either B or C is the right target, it can't be both. I suppose I shouldn't be surprised that has a name.

But I wouldn't want to assume that all my graphs are in fact true directed pseudoforests because I'm quite sure that they are not (for the same sorts of practical reasons that I need to check for cycles in the first place!) and I would need some way to check & fix them before I could replace my cycle detector with a pseudoforest-assuming checker.

Also, your findCycles picks only the edge which completes a cycle as "bad", though removing any edge in the cycle would get rid of it. Maybe this is not a problem, but I thought I would mention it.

I don't think it winds up being an issue for me since I use findCycles to list the problematic cycles and then fix by hand.

It may help to give one example of how I use this: I need Wikipedia links on my website to be consistent for various reasons, and avoid WP redirects, so I have a list of URL1->URL2 rewrites; so I might write in an essay a link to the Wikipedia article for J. R. R. Tolkien, which redirects to the actual WP article John Ronald Reuel Tolkien; linkchecking tools eventually report this and define that URL rewrite; so far so good; but 10 years later*, the WP editors finish debating the naming and conclude that per guidelines, it'd better be at the most famous name, and they move John Ronald Reuel Tolkien back to J. R. R. Tolkien; my script detects this a year later and that gets added to the list... but now it's circular. The graph cycle detector reports a loop John Ronald Reuel Tolkien <->J. R. R. Tolkien, I look, see it's not at Reuel any more, and fix the rules. To me it doesn't matter which edge is flagged as 'bad', since I'm going to go check the URLs anyway to see what's going on. Sometimes articles get split apart, sections get renamed or deleted, the quality badly degraded, or deleted outright. So I don't care too much about the details. Trying to automatically repair the flagged cycles would take far more time & complexity than it's worth.

* I have been using these Wikipedia links in my writings since ~2009, and I set up the redirect-rewrites like 2 years ago, and I intend to use them indefinitely.