pmmp / PocketMine-MP

A server software for Minecraft: Bedrock Edition in PHP
https://pmmp.io
GNU Lesser General Public License v3.0
3.27k stars 1.55k forks source link

Plugins/blocking operations hogging the AsyncWorkers causes significant server slowdown #1161

Open dktapps opened 7 years ago

dktapps commented 7 years ago

Issue description

A frequent problem I see these days is plugins are now often using AsyncTasks to perform lengthy blocking operations. When I say "lengthy", I mean > 1 tick.

This can become a serious problem if plugins are frequently scheduling lots of AsyncTasks, especially if they are performing blocking operations such as disk I/O or network I/O.

Since ALPHA6, the PocketMine-MP core now always uses AsyncTasks for chunk compression and serialization to improve performance and take advantage of the whole of the CPU.

PocketMine-MP opens a thread per core of the CPU (in hyperthreaded CPUs this is one per logical core) to allow CPU-expensive operations such as compression and world generation to take advantage of the entire CPU. If a plugin schedules an AsyncTask to a worker to perform a blocking operation, that worker is subsequently unusable until the operation completes. This means that while the operation is executing, one core of your CPU is not doing anything.

This problem gets worse when plugins spam lots of blocking AsyncTasks. The tasks will be distributed evenly across the AsyncWorkers, which means you can arrive at the point where all of the workers are performing blocking jobs... which means that important core stuff that executes on the workers, such as chunk serialization and compression, is significantly delayed, while your CPU does nothing whatsoever and could be executing other tasks.

Increasing the worker pool size won't help the problem either, because as stated above, tasks are evenly distributed across all workers. Increasing the worker pool size will only slightly reduce the size of the problem.

The TL;DR of this is that it is a bad idea to use the AsyncWorkers heavily for blocking operations, as it will impact on the entire server performance.

Steps to reproduce the issue

  1. Make a plugin that does lots of blocking operations with lots of AsyncTasks
  2. Watch server performance appear to deteriorate while actually waiting for blocking operations to finish on workers.

Possible solutions

Ideally blocking operations would be done on their own dedicated threads to avoid blocking up the worker pool and slowing down the server. However I think the correct and safe use of PHP threads is going to be out of the reach of most average plugin developers.

A possible alternative solution is to provide a second worker pool which can be used for blocking operations, leaving the server main worker pool free to use the CPU to do important CPU-expensive stuff. That way the second pool can be overloaded with blocking jobs as much as you like while not blocking the core pool used for expensive stuff.

MisteFr commented 7 years ago

Is it related to the infinite generation issue ?

dktapps commented 7 years ago

That depends. Have you tested the Building Terrain problem on a server with no plugins installed?

MisteFr commented 7 years ago

Nop I am going to try but to make it happens I need time and lots of players to join. But it can be related cause I am using a lot of AsyncTask ( ~ 1 scheduled every 2 secs).

SOF3 commented 7 years ago

Note that the second AsyncPool must have enough workers too. Otherwise, the blocking operations will still accumulate.

In simple words, the proposed solution does not really solve the problem of having insufficient workers to handle blocking operations, but only isolates the problems.

dktapps commented 7 years ago

@SOF3 the problem is that blocking operations block the main worker thread pool, which causes this problem. I don't care if a second pool blocks as long as the main worker pool is free to do whatever it needs to do.

markkrueg commented 7 years ago

I really like the idea of having a dedicated pool for the core and a separate pool for plugins. It could be an optional setting for those with few cores. Even better would be a setting to say how many cores are dedicated to each.

I wouldn't be surprised if this is what I am seeing happening. When I see heavy lag I notice there are often hardly any messages in the console (even with the debug messages we inserted). Then after a while (30 seconds or a minute or two sometimes) it will suddenly "free up".

Nothing is consistent enough to provide proper debug information...which is frustrating. But it feels like something similar to what dktapps is describing in this issue.

How is it possible to monitor "blocked" pools or threads? I assume they would not show as 100% on something like HTOP. Is there any way to watch for this scenario?

dktapps commented 7 years ago

@markkrueg blocking operations don't cost any CPU execution, meaning that other threads could be making use of the CPU while blocking operations are taking place (async I/O). All you would see is the server seemingly not doing much for no apparent reason.

As I said, ideally this kind of operation would be done using dedicated threads or dedicated workers, but I think that's outside the comfortable capabilities of most plugin developers.

A timings equivalent for worker threads would also be good.

Thunder33345 commented 7 years ago

isnt the normal async already quite dangerous when used wrongly, how bad would the dedicated threads be, isnt it the same as async? in using it like no passing $server etc

Muqsit commented 7 years ago

Just want to put this out.. It's not related to the infinite building terrain bug. I've tried calling an async task during PlayerPreLoginEvent, PlayerLoginEvent and PlayerJoinEvent. All of them completed and yet the building terrain screen continued on.

No, I don't know how to reproduce that bug. It was tested on a production server for trial and error. I know that was a bad move, but it was a harmless test.

SOF3 commented 7 years ago

It's more accurate to call this a "throttle" of the async workers. Too many AsyncTasks get scheduled in a short time, forming a bottleneck.

dktapps commented 7 years ago

@Muqsit "an" async task won't do it. You'd need to schedule lots of them, or very long ones.

CortexPE commented 7 years ago

image

CortexPE commented 7 years ago

It's an attack known as "Slow-Loris" kinda similar to this issue.

tnytown commented 7 years ago

@CortexPE The number of connections to the server is not the problem if I'm interpreting this correctly. The problem is plugins scheduling multiple long running AsyncTasks, thus rendering multiple workers in the queue inaccessible. The time limit you suggested however may be a viable solution to the issue. Ultimately, this boils down to developer education. Developers should not be putting long running operations into AsyncTasks in the first place.

JackNoordhuis commented 7 years ago

Perhaps adding the ability to schedule async tasks with a priority? This way core tasks are queued at a higher priority than plugin tasks by default. Plugin developers could still abuse the priority and this would not completely eliminate the throttling problem but it would stop core tasks being pushed to the end of a long queue.

SOF3 commented 7 years ago

Prioritized async tasks do not really solve the problem as CPU time is still wasted when executing low-priority tasks (from @dktapps).

If you look into the code in AsyncPool, you would find that the async tasks are queued using the Worker::stack() method directly. To implement prioritized async tasks, an infrastructural change in how AsyncPool works needs to be carried out. As it does not really solve the problem, such a change would be too costly.

daniktheboss commented 6 years ago

I confirm this with a running socket on async, the chunks dont load when the async task is running

dktapps commented 6 years ago

@daniktheboss you should read up into using dedicated threads for that, particularly if you're using a long-living blocking socket. Async workers are not suited to that kind of abuse.

dktapps commented 6 years ago

After doing some work on network I've come to the realization that we need a dedicated thread pool for network processing, and not be abusing the async workers. The usefulness of the workers is heavily limited by the fact that are only polled every 50ms.