[Idea] Use locking on tile objects to perform multithreading pathfinding and getSpectators

gunzino commented 5 years ago

Hello, I think about boosting TFS performance for years and as we all know everything (except networking) happens in one dispatcher thread. The most consuming functions are getSpectators and A path finding algorithm. Recently I am playing with locks in Boost library and there is a possible implementation of reader/writer locks which would perfectly suit this purpose. Since A and getSpectators ONLY reads information from tiles they may probably execute in separate threads asynchronously.

The implementation would be based on boost::shared_mutex. Since every tile is allocated on map loading it would include shared_mutex in the objects. Then all modifying functions (addThing etc.) would be edited to acquire the lock. Basically it may be based on example from there: https://stackoverflow.com/questions/989795/example-for-boost-shared-mutex-multiple-reads-one-write

Basically the tricky thing to solve would be to not perform write/read at one time. (So maybe notifying objects that something changed in some tile in some next dispatcher task).

Since there are more experienced developers than me I would like to consult this idea with you guys. If I get green light from you I may open PR and start to develop this.

gunzino commented 5 years ago

As I've checked code I would start with pathfinding since getSpectators needs some work on editing Tile::moveCreature.

djarek commented 5 years ago

Unfortunately, I believe that this kind of architecture would just lose all the parallelization gains in synchronization overhead. Not to mention potential issues with deadlocks because of how scripts interact with the engine code.

Ultimately, the best way to introduce parallelization in a MMO game server is to shard the game world. Unfortunately, it's also impossible to implement in TFS without a major rewrite.

I recommend watching https://youtu.be/2yXtZ8x7TXw - the talk presents the issue of synchronization quadrants and misuse of mutexes.

gunzino commented 5 years ago

@djarek

Thanks for the video tip. What about performing getSpectators in one part on map in first thread and pathfinding on other side of map in second thread? I think in that situation the synchronization problem is not the thing. Since in TFS most of the time is used to read.

Also for example: getSpectators in Game::addMagicEffect is used only to send protocol message to players. This is called about 30 times when casting exevo gran mas vis on empty map - everything in dispatcher thread. Do you see a problem by sending this in separate thread and invoking shared_lock on each QTreeNode? If some creature moves during the sending time (which has low probability) - it only have to synchronize on one or two QTreeNodes and rest is done without blocking.

I also figured out that C++ include std::shared_mutex since C++17 (part of GCC 6+) which has pretty good performance compared to boost::shared_mutex (implemented other way). It is best way to implement read/write. You can juse use std::shared_lock on multiple paralell readers and std::unique_lock on writer (writing can always happen in dispatcher thread).

I don't say that this implementation can be applied of every call of getSpectators but Game::addMagicEffect is one of examples. I also dont see problems with path finding since it doesnt mutate data.

About deadlocks: Adding std::unique_lock in start of functions that mutate data (addThing, removeThing and some more) should handle the problem shouldnt it? Lock is always freed if out of scope.

Also the lua scripting would be always done in dispatcher thread, I don't see any significant performance problems with that.

djarek commented 5 years ago

Thanks for the video tip. What about performing getSpectators in one part on map in first thread and pathfinding on other side of map in second thread? I think in that situation the synchronization problem is not the thing. Since in TFS most of the time is used to read.

That's very close to the idea of sharding I mentioned earlier, but sharding is your idea taken to a "next level". You run a small fragment of the game world on a separate thread, nothing is shared. Whenever a script/player from another shard needs to interact with other shards they use some message-passing interface. Unfortunately this kind of architecture is very different to what TFS has currently.

In my opinion it's much easier to get a significant performance improvement by doing smarter caching of map/spectator state. Slapping mutexes into every tile will create more problems than it will solve. Let's just consider the size increase - map objects constitute the majority of memory usage of TFS, adding ~56 bytes to each tile will effectively double the memory usage.

I also figured out that C++ include std::shared_mutex since C++17 (part of GCC 6+) which has pretty good performance compared to boost::shared_mutex (implemented other way). It is best way to implement read/write. You can juse use std::shared_lock on multiple paralell readers and std::unique_lock on writer (writing can always happen in dispatcher thread).

Performance of a lock/unlock is not everything, you have to take into consideration the entire effect it has on an algorithm. Whenever you submit work to another thread and expect a result from it, you will incur additional synchronization overhead. This is not just the locks and unlocks, but also potential synchronous waits or memory allocations necessary to facilitate getting a result of a computation submitted to another thread.

About deadlocks: Adding std::unique_lock in start of functions that mutate data (addThing, removeThing and some more) should handle the problem shouldnt it? Lock is always freed if out of scope.

Also the lua scripting would be always done in dispatcher thread, I don't see any significant performance problems with that.

Scripts often end up causing a fair bit of "recursion", e.g. a lua foo() function calls Tile::bar() which calls lua baz() which calls Tile::bar2(). In this case it's very easy to get accidental lock order inversion (which sooner or later results in a deadlock). Given how the TFS is currently designed around mutable state, I don't see how this can be avoided, without introducing race conditions.

gunzino commented 5 years ago

Performance of a lock/unlock is not everything, you have to take into consideration the entire effect it has on an algorithm. Whenever you submit work to another thread and expect a result from it, you will incur additional synchronization overhead. This is not just the locks and unlocks, but also potential synchronous waits or memory allocations necessary to facilitate getting a result of a computation submitted to another thread.

Personally, I would rather live with higher RAM usage than with bottleneck in a program. RAM is pretty cheap thing nowdays. About the Game::addMagicEffect - what memory allocations you mean? Because I think that magic effect message could be sent directly from second thread to player's protocolgame (it may need additional lock in protocolgame probably) and won't affect dispatcher thread. The sync would be needed in pathfinding probably since the dirList should be sent somehow.

Scripts often end up causing a fair bit of "recursion", e.g. a lua foo() function calls Tile::bar() which calls lua baz() which calls Tile::bar2(). In this case it's very easy to get accidental lock order inversion (which sooner or later results in a deadlock). Given how the TFS is currently designed around mutable state, I don't see how this can be avoided, without introducing race conditions.

I dont understand where the deadlock should happen since after calling every lua function the lock is freed and also important thing is that other threads would be read-only and they won't mutate any data. So if some lua script is mutating data (from dispatcher) it would invoke unique_lock so those reading threads would have to wait but this is low probability to happen so no blocking at all. Also if I understand it correctly when you call unique_lock on shared_mutex from same thread twice or more it won't block.

gunzino commented 5 years ago

@djarek

If we work with fact that other threads are READ-ONLY we could save a lot of CPU time in dispatcher thread by not touching mutexes at all while reading. Only invoking unique_lock on dispatcher thread may cause some additional overhead when it comes to locks. But there is very low probability of dispatcher to use same lock that other thread is reading from - map is very big and there are a lot of players doing stuff on several different spots on map.

There are a lot of places in code where getSpectators is only used to send information to protocolgame instance of every player on screen. I think this can be done in separate thread using locks since it does not need to send any data back to dispatcher - it can send protocol message directly.

Performing this functions + path finding in separate threads may save significant amount of CPU time in dispatcher which can be then used on other features.

I've started to look into this issue after I wanted to expand view range of monsters from 9x9 to 14x14 or more and discovered it would probably kill dispatcher with getSpectators and path finding calls. 9x9 area is restrictive and make monsters dumb since player can run 2-3 sqm out of screen and monster become idle.

ranisalt commented 5 years ago

This is a very interesting discussion.

I once had a shot at getSpectators where instead of searching for spectators inside the map, an entity would store a container with spectators that gets updated with every move. Something like:

Entity::onWalk() {
  for (entity: entitites in the new column/row visible) {
    entity.spectators.add(this)
  }

  for (entity: entities in the column/row no longer visible) {
    entity.spectators.remove(this);
  }
}

Entity::onLogout/onDie() {
  for (entity spectators) {
    entity.spectators.remove(this)
  }
}

That would spread the load of finding spectators to every move instead of once in a while doing a large search. Idling would also reduce to just checking if spectators is empty. What about?

djarek commented 5 years ago

Yeah, that's more or less what I had in mind when I mentioned smarter caching.

yamaken93 commented 5 years ago

Something i want to add to this discussion is how much cpu time is spent in Map::getTile. Maybe if the memory layout was built to make Tiles that are close to each other in the same Memory Page(make them closer in memory to help with cpu cache) it would speed up both path finding and get spectators(both being the top cpu usage in perf). @ranisalt idea is great too

gunzino commented 5 years ago

@ranisalt That's very good propose. What data structure would you use to store this cache lists? Also we stiil need to solve things like magic effect sending to clients. If you cast "exevo gran mas vis" it has to calculate spectators on every 70 tiles. Maybe combination of ranisalt propose and my idea would bring significant boost.

ranisalt commented 5 years ago

What data structure would you use to store this cache lists?

Probably an unordered set for faster access. It does not need ordering and sets are fast to modify and search.

Also we stiil need to solve things like magic effect sending to clients

Maybe we could extrapolate viewport and store spectators for a few more tiles for that purpose. storing spectators for 10 tiles up/down and 14 tiles left/right would be enough.

gesior commented 4 years ago

There is multi thread monster AI (A*) by Kondrah (he sells it for 500$). I got it on Kasteria... it does not work. It was obvious for me that his code is working in few threads, but they all are locking whole dispatcher for time of AI execution. So there is no optimizations of anything.

Tile locking idea is also stupid. TFS can easily handle 4000 online if each one of them is in other location. Problems are where there are 100 players on screen. In this case tile/area locking would be even worse because it would waste 200% time for locking and synchronisation.

Also all common Lua scripts would result in dead locks after few seconds.

gunzino commented 4 years ago

@gesior

So the only thing that may help is to cut the map into several pieces and run them in separate threads. The pieces should be like islands so the player's screen cannot be in two different pieces at once.

In the case, you described it would at least help by that it won't slow down other threads. The implementation should not be that hard, would just need some memory management changes. (eg. releasing creature/item objects in different way) and also some changes needed in scripting engine. (maybe all game threads will have their own script engine object)

gesior commented 4 years ago

@gunzino It would be hard to implement and you would need to define 'parts of map' (by positions min/max x,y,z). Still it would only allow 'path finding' in XX extra threads ['read access', not modifying anything on map], not executing 'dispatcher' (walking, talking etc.) in more than 1 thread. Walking, talking, using items execute LUA and LUA may want to modify something in other 'part of map', which would ... bla bla bla .. YOU JUST CANNOT DO IT

Only implementation that partially works is Kondrah implementation of multithread path finding. In moment, when 'dispatcher' is calculating paths for X players, it sends these paths calculation to X threads. Let's say we got 1500 players and monsters 'active' and we send these calculations to 8 threads... but until all threads finish calculations 'dispatcher' (main TFS thread) is locked.

It may get a little boost from multithreading. Biggest problem is that in case of OTS we search for CPU with top single core performance and these CPUs got 4 cores (4790K, 7700k).

Soon there should be 8700K/9700K with 6/8 cores in datacenters, but right now difference in path finding in multiple thread isn't big.

gunzino commented 4 years ago

@gesior

My idea is that you would not need to solve those problem at all.

My idea is that every map section should be like separate server: mainland - server1 roshamuul - server2 oramond - server3

But they will share same port (7172) and players does not have to reconnect to travel between them.

All of then will have separate Game instances with no no any collisions. The only thing to solve would be how to travel players between those ‘servers’.

If path finding will be called on server1, only server1 will be blocked.

Maybe it would also need some global dispatcher for sending parcels etc. but for me it seems solvable.

gesior commented 4 years ago

@gesior

My idea is that you would not need to solve those problem at all.

My idea is that every map section should be like separate server: mainland - server1 roshamuul - server2 oramond - server3

But they will share same port (7172) and players does not have to reconnect to travel between them.

All of then will have separate Game instances with no no any collisions. The only thing to solve would be how to travel players between those ‘servers’.

If path finding will be called on server1, only server1 will be blocked.

Maybe it would also need some global dispatcher for sending parcels etc. but for me it seems solvable.

What about actions that do operate on players on more than 1 map part? Sending messages? Globalevents? I don't see anyone interested in making instance-based server. That are hundreds of hours of programming and very easy to forgot one little change that would crash or freez server.

If you want to do 'separate servers' between which you can 'travel', it would be easier to write some reverse-proxy in front of 3 OTSes and make it add extra packets to reconnect players without really reconnecting them.

As we said in comments above, it's not possible to do multi thread path finding on current engine. It's totally incompatible and adding compatibility (locks) will waste more CPU then you get from multi thread path finding.

Main question is: What are benefits of all that work? What problems do we got, that we can't solve easier way? I don't see any. Problems with path finding are visible only on 1000+ online servers. How many servers with 1000+ we got now? In most cases they are full of afk botters that are not path finding (runes, skills).

otland / forgottenserver

[Idea] Use locking on tile objects to perform multithreading pathfinding and getSpectators #2539