tobymao / 18xx

A platform for playing 18xx games online!
https://18xx.games
Other
274 stars 179 forks source link

Graph performance is very bad in worst case scenarios #10331

Open michaeljb opened 5 months ago

michaeljb commented 5 months ago

Discussed in the site chat this morning:

Can anyone please tell me what is wrong with the following. Trying to place tile 24 on B25 in game https://18xx.games/game/151565

it does let me when I clone that game though the game is running super slowly on my browser, takes forever to load

Thanks it is running slow here too and keeps refreshing and losing the tile

I repro'd the slowness but haven't had time to profile or do a cursory investigation of whether the slowness is the game processing or the rendering. (also need to see whether the issue is specific to 18GB)

Copied the game data to this gist

Maybe similar? #10123

michaeljb commented 4 months ago

Firefox profiling shows the slowness coming from computing the graph during rendering, starting from the connected_hexes call in tracker_available_hex:

https://github.com/tobymao/18xx/blob/3794ce0dd/lib/engine/step/tracker.rb#L494

From there it's a pretty deep stack of walking the graph

Screenshot 2024-02-17 092523

michaeljb commented 4 months ago

https://github.com/tobymao/18xx/blob/3794ce0ddcfbb174d74b1d3ef609e0a0ead633aa/lib/engine/step/tracker.rb#L494

https://github.com/tobymao/18xx/blob/3794ce0ddcfbb174d74b1d3ef609e0a0ead633aa/lib/engine/step/tracker.rb#L468

https://github.com/tobymao/18xx/blob/3794ce0ddcfbb174d74b1d3ef609e0a0ead633aa/lib/engine/graph.rb#L91

https://github.com/tobymao/18xx/blob/3794ce0ddcfbb174d74b1d3ef609e0a0ead633aa/lib/engine/graph.rb#L234-L252

https://github.com/tobymao/18xx/blob/3794ce0ddcfbb174d74b1d3ef609e0a0ead633aa/lib/engine/part/node.rb#L39

michaeljb commented 4 months ago

Adding a bunch of temporary puts statements to debug this, including some I've added dozens of times before, inspired me to finally make them permanent and set up a global logger for the game engine and views - #10350

If that PR was already merged and this problem was encountered, identifying it as a graph problem would be as simple as adding ?l=0 to the game URL and watching the console.

michaeljb commented 4 months ago

The problem is something to do with the deletion on the converging check from #7788, removing these lines lets my browser compute the graph (probably not correctly) in about 0.02 seconds instead of 55 seconds:

https://github.com/tobymao/18xx/blob/5df8d6f3e747faa236ae829063270a2359ea2e46/lib/engine/part/node.rb#L86

https://github.com/tobymao/18xx/blob/d1b2afa1b04cb5f5905d8009096afac307410f00/lib/engine/part/path.rb#L178

michaeljb commented 4 months ago

Still have some failing tests, but I think I've got some progress on my graph-converging branch - https://github.com/michaeljb/18xx.games/commits/graph-converging?since=2024-02-19

michaeljb commented 3 weeks ago

I have some real progress on my bfs-graph branch. https://github.com/michaeljb/18xx.games/commits/bfs-graph/

There's still a lot of work to make it to support the custom args on Engine::Graph and make it a drop-in replacement.

The key part that will fix the algorithm for these bad cases isn't switching from DFS to BFS, but instead of the converging_path logic, it tracks from which direction paths and nodes are added to the graph, i.e., once a path has been reached from both sides it can be skipped.

My graph also advances one function call at a time, instead of recursively traversing the whole graph. This enables visualizing the graph as it grows, and would be required for theoretical future features like persistent lazy graphs that don't get cleared, or bidirectional search for more efficient home-destination connection checking.

michaeljb commented 3 weeks ago

In game 151565 the tile lays at action 536 and 537 on F19 and G20 raise the number of total node.walk and path.walk calls needed to compute LNWR's graph from 445 to 2711.

at 535:

Screenshot 2024-06-22 185127

at 537:

Screenshot 2024-06-22 185134

While not adding any actual new connectivity for LNWR, these tile lays add a lot more converging exits, and they're first really big jump in the number of walk calls made.

As the game progresses the number of calls gets huge.

Action 650 upgrades Derby, adding another token slot to the city and unblocking LNWR. The walk calls here rise from 196,238 to 1,659,446. By the end of the game, it's 5,862,969, plus 1,117,663 calls that return quickly via guard statements.


I was hoping that finding the first big unnecessary jump in walk calls would make it easier to describe what exactly the problem in the current algorithm is, but the track network at that point in the game is already fairly complex. I can't be more specific than saying the converging_paths logic is buggy and leads to lengthy loops.

michaeljb commented 3 weeks ago

When a check is made to see if a corporation can run routes, route_info calls compute(routes_only: true). This still does a full DFS search, but only starting from one token. It could probably be optimized to return as soon as multiple nodes are found.

michaeljb commented 3 weeks ago

Action 650 upgrades Derby, adding another token slot to the city and unblocking LNWR. The walk calls here rise from 196,238 to 1,659,446. By the end of the game, it's 5,862,969, plus 1,117,663 calls that return quickly via guard statements.

It looks like these numbers are wrong, but the right numbers are still in the 100Ks if not millions. Looks like I was messing up the default argument when plumbing walk_calls through.

michaeljb commented 3 weeks ago

Action 650 upgrades Derby, adding another token slot to the city and unblocking LNWR. The walk calls here rise from 196,238 to 1,659,446. By the end of the game, it's 5,862,969, plus 1,117,663 calls that return quickly via guard statements.

It looks like these numbers are wrong, but the right numbers are still in the 100Ks if not millions. Looks like I was messing up the default argument when plumbing walk_calls through.

With #10951 I am actually getting the same endgame numbers at action 759 (LNWR's last turn of the game, in OR 8.1). Not sure why some of the earlier counts are different than what I was seeing before 🤷

michaeljb commented 3 weeks ago

Some good progress this weekend, I was able to configure 18GB to use my new adapter class instead of Engine::Graph here's screenshot of the side-by-side logs for loading up the problematic game to action 759:

Screenshot 2024-06-23 134642

I can't step through the game yet to see how much that performance has improved, getting some errors I haven't had time to fix yet, but it's certainly coming along nicely!

michaeljb commented 2 weeks ago

The real source of trouble appears to be in eliminating overlapping paths from connected_paths. If a path is overlapping (e.g., the pink path in the screenshot here), to see if it's legal, it needs to search for every possible route back to a token that avoids the path(s) it overlaps with, so if there is no such route it could be a very long process to confirm that.

Screenshot 2024-06-29 213332