Rework how group edges are added

I've been making a lot of changes to how sorting works in order to speed it up, and so far sorting using the latest abi-changes branch build is about 290x faster than sorting using the v0.18.2 release (0.18.3 is about 5x faster). Most of the changes can cause libloot to produce different load orders in ways that shouldn't cause any problems, but I like to avoid doing that often, just in case.

Group handling has changed to track whether a group edge involved user metadata or not, and its vertex lookup got optimised, but the general approach it takes hasn't changed, and that's something I want to revisit for two reasons:

Complexity of the current approach
Potential performance improvements

Now's a good time to do this, because changes will probably affect the resulting load order, and so it would be good to lump them in with all the others that change the load orders that libloot produces.

Complexity

The current implementation for adding group edges grew organically after originally being implemented as an extension of "load after" metadata handling, and I think the code is pretty hard to understand and overcomplicated.

Also, unlike the new tie-breaking algorithm, which tries to enforce the old load order and otherwise minimally deviate from it, there's no real goal it has that can help explain why it chooses to add edges in the order it does, or skip those that it does, so it's hard to understand why it does what it does.

Performance

Following on from loot/loot#1370, adding edges between overlapping plugins is now usually the slowest part of sorting, but checking for overlaps is relatively fast (I tested separating the checking from the adding of edges: with my test load order, the overlap section takes 11 seconds, < 1 of which is checking for overlaps), the slow bit is checking for paths going in the other direction to avoid cycles.

Checking for paths can be sped up by reducing the number of edges in the graph. That can be done by not adding edges where there's already a path in the same direction, but that involves checking for a path and my testing suggests it doesn't do enough to counter its own performance hit. The other way to reduce edges is to try adding fewer edges to begin with. Edges overwhelmingly come from two sources: overlaps, and groups.

I can't see how to try adding fewer overlap edges, but the way group edges are added at the moment seems obviously non-optimal: each plugin gets an edge added to every plugin in every group that transitively loads before the group the plugin is in. In many of those cases, that's unnecessary, because the plugins it transitively loads after will also have edges to the plugins they load before. If no edges were skipped and all groups had plugins, you'd only need to add edges between a plugin and the plugins in the groups that its group directly loads after.

I've come up with an approach that I think is conceptually simpler and results in fewer edges being added. The algorithm is:

Create a graph of the groups, with edges going from groups that load earlier to groups that load later.
Create a set for storing groups that have already been processed/finished
For each root vertex, i.e. groups that don't load after any other group, perform a DFS to visit all the groups that load after it:
1. Create a stack of buffers for holding plugins at different levels of the DFS
2. Create a stack of groups for holding the groups that have been encountered at different levels of the DFS
3. On examining a new out edge (a tree edge - before discovering the vertex at the other end):
  1. Add the target group to the stack.
  2. Find all the plugins in the target group
  3. Create a buffer for plugins, but don't add it to the stack
  4. For each plugin in the last buffer on the stack:
    1. For each target plugin:
      1. Add an edge going from the buffered plugin to the target plugin, unless adding the edge would cause a cycle.
    2. If there are no target plugins or if one or more edges were skipped (due to causing cycles), add the buffered plugin to the new buffer (to carry it forwards to the next target group to be processed if this target group has any successors).
  5. Find all the plugins in the source group
  6. For each source plugin:
    1. Do the same as for the plugins in the last buffer on the stack above.
  7. Add the new buffer to the stack
4. Once a vertex has been finished (i.e. all its out-edges have been processed):
  1. Pop the last group off the stack.
  2. Pop the last buffer off the stack.

This approach adds edges only between plugins in groups that are directly connected, unless that's not possible, and then the edge is added between the preceeding group's plugin and the closest successor that an edge can be added to without causing a cycle. If a target group contains no plugins, this effectively propagates all the source's plugins as sources for the target group's targets.

The set for storing groups is simply the DFS colour map, but it's shared between DFSes because if one DFS already added edges going from one group's plugins, there's no point in doing the same if that same group is encountered from a different starting point in another DFS, that won't add anything new.

The group stack is used when adding an edge between two plugins to see if the path between the plugins' groups involves any user metadata, so that the edge type can be set appropriately. The type also takes into account if either plugin is part of its group due to user metadata.

I initially had the logic buffering a plugin when all of its edges to the target group plugins were skipped, but changed that to when any of them were skipped because the former causes successor edges to get skipped incorrectly. For example, if we've got three groups A -> B -> C with plugins A1, A2, B1, B2, C1, C2 and existing edges B1 -> A1 and C1 -> B2, buffering on all skipped causes the path between A1 and C1 to be undefined, while buffering on any skipped causes the edge A1 -> C1 to be added. It does mean there are more redundant edges added, but too many is better than too few.

I've pushed an implementation of the approach described above, it's in the reimpl-group-edges branch.

Unfortunately, since writing it I've realised that the lack of special handling for the default group causes problems when a plugin in the default group needs to load after a plugin in a later group.

My solution to this is to pretend the default group is empty when running the DFSes from the earliest groups to the latest groups and it is the current source group (i.e. you can add edges going to default group plugins, but not coming from them), and then run one last DFS running from the default group to the latest groups.

I haven't implemented or tested this solution yet though.

I'm also having doubts about sharing the colour map between the existing DFSes, as if there are two root groups that have edges to the same group, the second group to be DFSed from may have plugins that have some edges skipped and so which should be carried forwards past the shared group, but they won't be because the DFS will stop at that group. However, there's also no point adding edges for the plugins from the shared group or later groups, as they'll already have been added. Maybe that could be optimised by using a separate "processed" map that is shared and doesn't stop the DFS from continuing into those groups, but causes their plugins to be ignored when they're the source group plugins for the current edge, so that only plugins from before the shared group will have edges added going from them.

I've pushed a few more commits that stop sharing the colour map and implement the two-stage default group handling.

It turns out that sharing the colour map was a mostly pointless optimisation: I'd forgotten that adding an edge already first checks if an edge already exists going from and to the same plugins, so a shared set of processed vertices would mostly just save looping over plugins and finding out those edges already exist. I've left a TODO comment to add the shared set, but I might just not bother, since it would involve fewer checks but not reduce the number of edges added, and adding group edges is already very fast.

I'm much happier with the results when sorting with my 1619 plugin load order: when I had a single pass of DFSes, the resulting load order had a lot of differences compared to the old logic, but with the two passes there's now only 12 plugins moved in 6 blocks and they all look due to harmless tie break differences.

With the old logic, sorting added 75988 edges, 48726 of which were group edges. With the new logic, sorting added 52682 edges, 26029 of which were group edges. Despite a 40% drop in the number of edges, sorting was only ~1.5s faster (from a 30s sort). That's disappointing, but overall good results so far.

I came up with a few more cases that the new code couldn't handle well, so I've stripped out the check to see if a plugin could add all its edges to the plugins in the next group, so now all plugins are buffered. This means that similar to the old approach, libloot will try to add edges from a plugin to all the plugins in its transitive successor groups.

The result is that sorting now adds slightly more group edges than the old approach, though overall edges is slightly less and sorting speed is back to ~ 30s.

On the positive side, the logic is now even easier to explain, the sorting result is now identical to what the old approach gives in my test load order, and it does a better job of avoiding cycles. Previously it was possible to cause a cycle using groups but they're now always resolved, and for all the cases I've thought of it picks the most sensible load order (though sometimes there's no obvious best order). I've also now got a bunch of tests written to cover all those problematic cases. Hopefully I can use them to help reintroduce some working optimisations, but even if I don't get there I think the current state seems like a strict improvement.

I've also jettisoned the idea of sharing the colour map since it's such a minor potential improvement and I now realise I've got bigger fish to fry.

Unfortunately the new logic is much slower, here's some benchmarks:

Group plugin sorting performance

All tests with 1619 plugins and a 77 kB userlist that adds 9 groups and plugins into those groups.

LOOT 0.18.6 + libloot 0.18.3

[08:45:48.864497] [trace]: Adding edges based on plugin group memberships... +32s, +895485 edges [08:46:20.587801] [trace]: Checking plugin graph for cycles...

Failed with a cycle.

LOOT 0.19.0

[11:55:59.990147] [trace]: Adding edges based on plugin group memberships... [11:56:00.001850] [trace]: Checking plugin graph for cycles... [11:56:04.190361] [trace]: Adding edges based on plugin group memberships... +16s, +16239 masterlist edges, +879404 user edges [11:56:20.189313] [trace]: Checking plugin graph for cycles...

Failed with the same cycle as libloot 0.18.3.

LOOT 09f3f0b + new groups impl

[17:23:33.235063] [debug]: Current load order: [17:47:33.086743] [debug]: Calculated order: = 1440s

LOOT 09f3f0b + new groups impl with finished vertex set

[18:04:31.778234] [debug]: Current load order: [18:04:36.315655] [trace]: Adding edges based on plugin group memberships... +171s [18:07:27.373580] [trace]: Adding edges for overlapping plugins... [18:08:00.966720] [trace]: Adding edges to break ties between plugins... [18:08:09.698349] [debug]: Calculated order: = 209s

loot / libloot