robocode-dev / tank-royale

Git repository for Robocode Tank Royale
Apache License 2.0
144 stars 29 forks source link

Server is kidnapping CPU cycles #45

Closed SirStone closed 2 years ago

SirStone commented 2 years ago

Describe the bug Version of the server 0.17.1, the server after a while reaches high level of CPU % taken, to my old macBookPro from 2012 is reaching 100% or more, I monitor the process using the tool htop and filtering with the keywork tankroyale The problem doesn't look to appear until a game is started but when happens doesn't matter if the game is aborted or the bots are disconnected/unbooted/killed, the server is getting hold of my CPU

To Reproduce Assuming the problem is not just related to me, boot up a server, join some bots, run a game, stop the game, un-run the bots. Monitor the CPU usage from the server on each step.

Expected behavior server to use any CPU is required to him during the battle and releasing it when the battle is finished

Screenshots htop screenshots, ignore the blue higlighted row and fcus on the server row and the CPU% column

Server just started: Screenshot 2022-09-18 at 19 01 16

Walls and Crazy joined Screenshot 2022-09-18 at 19 01 53

Game started Screenshot 2022-09-18 at 19 02 14

Game aborted Screenshot 2022-09-18 at 19 02 35

Bots killed (after the games i aborted in order to avoid any illegal disruption of the game) Screenshot 2022-09-18 at 19 02 52

Desktop (please complete the following information): MacOS Catalina on MacBook Pro 13-inch Mid 2-12

Java info: java 18.0.2.1 2022-08-18 Java(TM) SE Runtime Environment (build 18.0.2.1+1-1) Java HotSpot(TM) 64-Bit Server VM (build 18.0.2.1+1-1, mixed mode, sharing)

extra running server with this command line: java -jar robocode/servers/robocode-tankroyale-server.jar --port=1234 --botSecrets=secret1 --controllerSecrets=secret2 The server jar is the one prelevated from the robocode-tankroyale-gui-0.17.1.jar upon extracting it

flemming-n-larsen commented 2 years ago

@SirStone Have you experienced this issue only for version 0.17.1 or also on older versions of the server? Version 0.17.1 had very few changes compared to version 0.17.0. Only the stuff with session-id etc. was changed, and should not be enough to hurt performance.

I have run the server with Java 18 on two machines:

And I ran from 8-30 bots at a time, with 30 TPS and ~900 TPS (with turn timeout = 1 ms). I have booted, unbooted, started battle from scratch, and restarted the server, but not rebooting it. I also tried out port 1234 instead of the default 7654.

I have been using IntelliJ Ultimate with profiling tools to monitor the server. The results between the two machines were very close, and under heavy pressure with 30 bots at ~900 TPS, this is the result:

unnamed

So I am wondering if my machines are simply very powerful, and hence I don't run into the issue. Or perhaps something is troubling the server on your side?

In the very early days I had problems with bots that did not disconnect correctly from the server. But this should not be an issue anymore, and the connection to bots is kept as long as they are not stopped/disconnecting the server.

SirStone commented 2 years ago

Hello I already answered via e-mail but I can't see my message here, so I assume that my answer never reached here... this was my message wrote the 19/09/2022:

I experienced this from 0.17.0 too. I can do more tests:
-Older version of the server
-running the GUI avoiding all my code influence

More people could participate with their experience in order to understand if my machine is an isolated case.

It just came in my mind that I experienced another thing, running with my interface so for now I can't completely say that it's not my fault.
Lately I run continuously a battle made of 2 round, after battle ends it is started again right away programmatically (I do this to get the results faster); after some time that I'm observing the battle, the graphic just hangs, aborting or starting a new battle generates the messages from the server for the controller but no more ticks are received.

I told you this only because could be related to the CPU kidnapping and some sub-process could be killed from the system.

Would you be interested in my game setup? Maybe I'm pushing in a wrong and unexpected way the timings of the game.

Today I'm here providing more screenshots, I've tested the very same steps with versions 0.17.0, 0.16.0, 0.15.0 and 0.14.0 and all shows the same behaviour For each, in order:

  1. server just started <-- CPU is low
  2. bots joined <-- CPU is low
  3. running game <-- CPU goes high
  4. stopping game <-- CPU remains high
  5. bots left <-- CPU still remains high

0.17.0

  1. Screenshot 2022-09-24 at 11 28 01
  2. Screenshot 2022-09-24 at 11 33 32
  3. Screenshot 2022-09-24 at 11 33 55
  4. Screenshot 2022-09-24 at 11 34 14
  5. Screenshot 2022-09-24 at 11 34 30

0.16.0

  1. Screenshot 2022-09-24 at 11 46 26
  2. Screenshot 2022-09-24 at 11 47 27
  3. Screenshot 2022-09-24 at 11 47 44
  4. Screenshot 2022-09-24 at 11 48 02
  5. Screenshot 2022-09-24 at 11 48 20

0.15.0

  1. Screenshot 2022-09-24 at 11 50 36
  2. Screenshot 2022-09-24 at 11 51 13
  3. Screenshot 2022-09-24 at 11 51 28
  4. Screenshot 2022-09-24 at 11 51 46
  5. Screenshot 2022-09-24 at 11 52 04

0.14.0

  1. Screenshot 2022-09-24 at 11 52 54
  2. Screenshot 2022-09-24 at 11 53 36
  3. Screenshot 2022-09-24 at 11 53 50
  4. Screenshot 2022-09-24 at 11 54 07
  5. Screenshot 2022-09-24 at 11 54 32

Test with the GUI I've noticed that running the GUI is not creating this effect, no need for screenshots, the CPU goes up when the battle starts but as soon STOP is used the CPU usage drops to the normality (for a not running game).

My two cents:

  1. The server, when the game is aborted leaves some routine running
  2. The GUI, when stopping a game, is calling something on the server (maybe killing/rebooting it?) that solves/hides the problem
flemming-n-larsen commented 2 years ago

I guess the issue you experience has existed with the server for a long time, as it has not changed much since February where I put the stuff on GitHub. I basically just adjusted game logics, fixed bugs, and added a few things like secrets and session id.

When benchmarking the server, I always run it outside the GUI. But it should not make much difference, as the server, booter, bots are all started up in their own processes, and the processes are killed when stopping these.

Regarding... 1) Yes, it could be that some left overs are still taking up CPU when aborting a game. I will look into that. I think I already benchmarked this in IntelliJ, where it looked okay. I am wondering, if it could be the Java version that makes a difference? E.g. the garbage collection could work differently.

2) When the GUI is killing the process, it simple sends a single line containing the letter q for "quit". Then the server exits its own process. The GUI controls the server via the ServerProcess.kt (line 78) And the server reacts on input within Server.kt (line 128)

The rest of the communication between the controller (GUI) and server is via websockets.

flemming-n-larsen commented 2 years ago

I did a new benchmarking of the Server using the IntelliJ Ultimate profiler on my Windows 11 PC running Oracle JDK 18.0.1.1 on Windows 11, AMD Ryzen 5 5600G 3.90 GHz (not running in full speed):

Skærmbillede 2022-09-24 210637

I am running 8 java sample bots with 30 TPS and turn timeout on 5000 us.

Even though my machine might be faster than your Mac Book from 2012, it should not matter, as we should see a rise on CPU and heap memory over time. But on my machine, it is going back to normal when I abort or restart a battle, and the curves are stable.

Max heap memory used: 147 MiB Max CPU usage (single thread): 5%

The server is running a single thread (ServerProcess.kt line 124) which is run to accept input from stdin (waiting for the q command) A thread is being started and stopped with the NanoTimer.kt responsible for triggering a "new turn" event, when the timer is started and stopped for a battle. Even though this thread could be leaking (?), it should not take up lots of CPU resources for before it reaches thousands of threads.


Regarding your battles... You run only two bots in 30 TPS and perhaps the default turn timeout of 30000 us? If you get near 1000 us (1 ms), you should expect problems (due to timer resolutions etc.). How many battles are your starting and aborting to observe the issue, and over how long time? If it is the first few battles, then something is definitely off.

SirStone commented 2 years ago

I will play with the game setup timing variables to see if something changes but right now... How many battles are your starting and aborting to observe the issue, and over how long time? always the first one, 100% of times and over how long time? starting the game and aborting it after 10 seconds, 1 second or 1 millisecond to me is producing the same high CPU never released effect.

For me this is not a stopping issue, is just one click for me to reboot the server and stop the problem and during the battle is taking high CPU either way. Because you don't experience this I suggest to wait until I try on another machine and/or operative system and maybe someone else can input their experience as well.

flemming-n-larsen commented 2 years ago

Yes, it sure would be great to hear if anybody else is experiencing this problem too. And it would like to hear what you experience on another machine as well.

stobias123 commented 2 years ago

I see this same thing! My setup is a little different. Looks like this.

Server version 0.17.2 - I'm on a fairly powerful Mac - i9

  1. Server is in docker container (one container per "environment")
  2. I run multiple server instances at a time. (Usually between 2-4)
  3. Two bots per server. 30TPS.

If I leave the server idle after bots leave I end up with very high CPU utilization.

flemming-n-larsen commented 2 years ago

@stobias123 Thank you for the feedback. 🙂

I realised that I focused too much on what was going on within the battles. But it seems like the issue is triggered after the battle has ended or is aborted. I am still not able to see 100% CPU usage on my systems/machines, but I do see it is using CPU when the game should have stopped running, which was definitely an issue.

I have had a deep look into the server, and I have fixed these issues on the server on the master branch, and these fixes comes with the next release:

Freeing the heap memory was not really a bug, but by design (keeping state in order to save a record of the entire battle history later). Anyways, I am freeing up the memory when the game is over to memory is free when the server waits for a new battle.

I am not sure if (all) issues are fixed now. But at least some of it has been fixed by now, and we'll see with the coming release if this also works out for you guys.

flemming-n-larsen commented 2 years ago

@SirStone, @stobias123 I have now released version 0.17.3 with several fixes for the server. I hope this one fixes the issue(s)?

SirStone commented 2 years ago

:clap: this 0.17.3 solves the problem for me!

  1. server just started <-- CPU is low Screenshot 2022-10-08 at 18 44 14
  2. bots joined <-- CPU is low Screenshot 2022-10-08 at 18 44 55
  3. running game <-- CPU goes high Screenshot 2022-10-08 at 18 45 12
  4. stopping game <-- CPU goes low!!!! Screenshot 2022-10-08 at 18 45 36

I dno't even need to un-boot the bots to go back to normal CPU usage now. No modifications have been made by me that could tamper with the results, better wait for the others to confirm my results.

My guess is that the problem was with the "not stopped timer" you mentioned.

flemming-n-larsen commented 2 years ago

@SirStone Thanks for the feedback. I am really happy that this issue is now nailed. I am sure that is was the timer that caused the issue. I will close this issue now.

SirStone commented 2 years ago

I experienced this from 0.17.0 too. I can do more tests: -Older version of the server -running the GUI avoiding all my code influence

More people could participate with their experience in order to understand if my machine is an isolated case.

It just came in my mind that I experienced another thing, running with my interface so for now I can't completely say that it's not my fault. Lately I run continuously a battle made of 2 round, after battle ends it is started again right away programmatically (I do this to get the results faster); after some time that I'm observing the battle, the graphic just hangs, aborting or starting a new battle generates the messages from the server for the controller but no more ticks are received.

I told you this only because could be related to the CPU kidnapping and some sub-process could be killed from the system.

Would you be interested in my game setup? Maybe I'm pushing in a wrong and unexpected way the timings of the game.

Il Lun 19 Set 2022, 22:16 Flemming N. Larsen @.***> ha scritto:

@SirStone https://github.com/SirStone Have you experienced this issue only for version 0.17.1 or also on older versions of the server. Version 0.17.1 had very few changes compared to version 0.17.0. Only the stuff with session-id etc. was changed, and should not be enough to hurt performance.

I have run the server with Java 18 on two machines:

  • MacBook Pro 2019, 2,4Mhz, 8-core, Core I9
  • PC with Windows 11 AMD Ryzen 5 5600G

And I ran from 8-30 bots at a time, with 30 TPS and ~900 TPS (with turn timeout = 1 ms). I have booted, unbooted, started battle from scratch, and restarted the server, but not rebooting it. I also tried out port 1234 instead of the default 7654.

I have been using IntelliJ Ultimate with profiling tools to monitor the server. The results between the two machines were very close, and under heavy pressure with 30 bots at ~900 TPS, this is the result:

[image: unnamed] https://user-images.githubusercontent.com/2179727/191107813-1cd9360d-d2d9-42a5-bfe4-fa66e60ab1fc.png

So I am wondering if my machines are simply very powerful, and hence I don't run into the issue. Or perhaps something is troubling the server on your side?

In the very early days I had problems with bots that did not disconnect correctly from the server. But this should not be an issue anymore, and the connection to bots is kept as long as they are not stopped/disconnecting the server.

— Reply to this email directly, view it on GitHub https://github.com/robocode-dev/tank-royale/issues/45#issuecomment-1251504838, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7Z3CFNZ2BVM4XPMN2WT3V7DCZZANCNFSM6AAAAAAQPQZQL4 . You are receiving this because you were mentioned.Message ID: @.***>