triplea-game / triplea

TripleA is a turn based strategy game and board game engine, similar to Axis & Allies or Risk.
https://triplea-game.org/
GNU General Public License v3.0
1.32k stars 393 forks source link

Headless bot servers are locking and failing to reload as well as not accepting a new map load. #6909

Closed prastle closed 4 years ago

prastle commented 4 years ago

How can the problem be recreated?

@SilverBullet @RoiEXLab @DanVanAtta @ron-murhammer The bots appear to lock up after an indeterminate number of rounds. With virtually most maps it appears not just revised. I resized three servers to double their ram and hard drive size in case it was a memory problem. This has not fixed anything. The only solution until the devs can fix them is to restart the servers directly from Linode. This appears to fix the problem temporarily. Hopefully they can get the headless bots fixed shortly since their does not appear to be any problems with individuals that host privately.

Game Version: Latest stable lobby bot servers.

Is there anything else we should know?

It appears to work again for awhile if the linode is restarted.

DanVanAtta commented 4 years ago

@prastle I see the bots with updated system specs are still updated, would you mind converting them back to how they were. It's best if we can keep the bots homogeneous. It's not really surprising that it did not change anything as we have sized max memory usage with account for how many bots are run and there is plenty of free disk space.

Some questions:

The only solution until the devs can fix them is to restart the servers directly from Linode.

Please avoid restarting bot servers until we've had a chance to look at them.

prastle commented 4 years ago

Locked up means no one can join them locked up means no one can load a new game... THUS USELESS Only fix atm is restart linode And btw I changed nothing other than a resize they are still YOUR BOTS that you loaded none of these problems in 1.9 @DanVanAtta If you join the lobby and snoop around I am certain you will find half or most locked up" since my last reload was 6 hrs? ago go have a look

prastle commented 4 years ago

And to be said if I had avoided restarting bots all players would have not played for 2.5 days @DanVanAtta I did mention this as did silver @many places and to all devs

DanVanAtta commented 4 years ago

@prastle no need for the hyperbole or the caps, keep calm.

If nobody can join a locked up bot, how can you change map?

For the bots that you can join, was anything done to them?

Have you tried running a moderator 'shutdown' on the "locked up bots".

Please do revert the system settings back to what they were since they did not help.

Understood that some bots needed help, I don't think it would have been 2.5 days though; for such serious situations do not wait for 2 days to post a bug report.

DanVanAtta commented 4 years ago

Is it the case that some bots you can join, and others you can? Though, when you can join a bot, the map selection might be broken?

DanVanAtta commented 4 years ago

And btw I changed nothing other than a resize they are still YOUR BOTS that you loaded none of these problems in 1.9

We're in this together, it is the project and we are a team. I try to avoid the words "mine", "yours", when talking about the project. It is not my project, a persons project, it is not my code, nor 'your code', it belongs to TripleA. Part and parcel is that people will all rotate in and out of the project, and the maintainers are the group ultimately that will need to fix (or not fix) any problems.

I'm not terribly surprised we're seeing some 2.0 pains with bots. The headless bot code is full of hacks and was done in an utter rush to enable players to join games without port forwarding.

prastle commented 4 years ago

Yes some you can join and some you cannot it atm seems like a time thing. Thus why I added more ram and hard drive to no effect. Simple answer All bots locked after a few hrs unless they still have a game going that has not crashed.

prastle commented 4 years ago

Moderator shutdown makes the bot reappear still broken a few seconds later so that is a no go as well... In addition to this since the ability to boot a ghost is gone for mods we might be dealing with 2 or three problems here

prastle commented 4 years ago

first report was created on Friday it is Sunday end of story. Granted I was camping and did not make the first report.

DanVanAtta commented 4 years ago

@prastle this issue was created 10 hours ago. I think this highlights how problematic it is that our communication is still split between multiple channels.

Moderator shutdown makes the bot reappear still broken a few seconds later so that is a no go as well...

Do you have any bot numbers where this was the case?

I think we probably need a way to log moderator actions. I'm trying to correlate events in the logs but not knowing what happened to which bots, and which were in what state is making it very difficult.

In addition to this since the ability to boot a ghost is gone for mods we might be dealing with 2 or three problems here

Maybe, but 'ghost' players are probably a side-effect. Running a shutdown should get rid of them.

prastle commented 4 years ago

The last moderator shut down I ran as a test was after you asked earlier above ^^ on the same bot I asked you to look at when you were live in the lobby with me...

bobstein has joined LaFayette has joined Willfistwu has left nicmontreal44 has left bobstein has left Sativa has left Vlada has left (4:53:53 p.m.) Maraxus: global game anyone?' DwayneDibly has joined DwayneDibly slapped Maraxus (4:57:13 p.m.) DwayneDibly: Anyone looking for a game? prastle slapped LaFayette bobstein has joined (4:57:42 p.m.) prastle: join a few :) you will see MississippiKid slapped RickJame5 andrewthree has joined (4:58:44 p.m.) prastle: example 307 Cali andrewthree has left frank-the-tank has left eboy82 has left (5:01:37 p.m.) prastle: well you could stay long enough to test (5:01:49 p.m.) prastle: example launch the save or switch to a new map (5:01:59 p.m.) prastle: no matter Wassmuss has left

This was bot 307 California which rejoined after I moderator stopped it a few seconds later...

prastle commented 4 years ago

After this post I can add some interesting news... Apparently my ghost is still in that bot from when I was trying to show you yours left... 742 pm ish bot 307 Cali

Cernelius commented 4 years ago

I believe this problem has been reported 2 days ago: https://github.com/triplea-game/triplea/issues/6875

To me, shutdown appears working fine (I assume the host will eventually break down again, though).

However, I really think "shutdown" should be renamed to something better, to explain you are bringing the host back at its initial state. The name really makes me think you are nuking the bot definitively, and I think moderators might not use it because of not knowing what they are actually doing to the host.

prastle commented 4 years ago

A rename would be good cer Since it rarely accomplishes the goal of a restart

Cernelius commented 4 years ago

I actually operated several "shutdowns" on bots broken as per what described at this issue, and it seemed always to work as I understand it should. However, I've no idea how long the host remained in working condition, thereafter.

prastle commented 4 years ago

join 307 Cer I have not fixed it and note it will not switch a map for me or launch

DanVanAtta commented 4 years ago

Good bug reports help get problems fixed faster. We need to isolate what is happening, a clear explanation of where we are seeing the problem is really useful.

Steps to reproduce should be a list of what needs to be done to see the problem. Joining bots (and failing), changing maps, and starting a game seem to be the three observable ways to trigger this problem.

I'm curious if there are any error messages that we have observed during all three of those. For changing maps, I saw the error was silent, that simply nothing happened.

Getting bot numbers on any of these issues would be helpful as well so we know where to look in the logs (along with time stamps of when problems happened).

Cernelius commented 4 years ago

Substantially the problem with bots is that, here, the ones that should have done the bug reporting are obviously the users experiencing the problems (thus, for example, not me, as I virtually never use bots). However, almost only relatively advanced users ever report anything, or make good bug reports (as you say), and such users I believe are usually not bot users.

I think the only proper bug report, made in the proper place by a user experiencing the problem is the one that I already linked, but it seems got overlooked by everyone (at least for the past 2 days).

prastle commented 4 years ago

Either way 207 is locked atm and I am sure a few others... Sniff around gents I am going to bed..

DanVanAtta commented 4 years ago

It's a tough balance between asking detailed questions and users not doing anything. Really if the prompts are read and followed, the bug report should be quite quality. A problem is people tend to want to say "this is not working" and leave it at that. Which always leads to obvious questions "what is not working", "what does that even mean, error message, silent failure, something else?,"

Regardless, the finger pointing is not interesting. It's valuable though us to be concise and specific.

DanVanAtta commented 4 years ago

My working theory on this problem is from server side 429s. We typically would not expect them as the bot should be 'well-behaved.'

I think the confluence of uploading chat messages and sending stack traces as individual chat messages is really the trigger. The gameOverException that we have had for some time, generally benign and arguably an error handling bug as the 'exception' seems to happen in exceptionally normal situations where users just use the application normally. This triggers a potentially bad sequence of events:

Any subsequent updates within the next minute can then hit a rate limit (429) error from server. Testing out a few code paths, an exception on updateStatus can lead to a dead game that you cannot join.

Short term, we can turn rate limiting to 'reporting mode' so it is no longer enforcing. Perhaps we should also change bots to only report an exception message and not the whole stack trace.

Long term, we'll likely need to harden any bot interaction with the server. In theory it was a likely vulnerability already, to some extent the rate limiting has been a 'chaos monkey' experiment with how bots handle failed requests to the server (apparently not very well after observing some errors injected).

I have not been able to reproduce the same behavior we are seeing where map switches fail silently. I wonder if the game update 429 errors are often killing the java socket acceptor thread which then kills communication to the server. I suspect there must be more to it as the failure to change maps is a silent error, no error message on the server (so I would have expected an error sending the message to have been reported client side, perhaps there is no error or there is one and we are suppressing it).

prastle commented 4 years ago

I created the issue out of exasperation about 12 hrs after the first Linode reset because the devs had not responded to any of the issues that the players posted or mods posted... Lets not forget I was camping. So Yes I 100% agree that the way devs receive important info is failing... :)

DanVanAtta commented 4 years ago

We should probably extract this into a new issue if we want to dive into the post-mortem.

  1. When servers are restarted, especially at a linode level, the dev team should be notified of which server and when (the first time the restart is done). Ideally dev team is notified before the restart is even done.
  2. The player topics were generally vague, it looks like forum postings is the first place this was highlighted.

I was happily camping as well, so we should also bake in that we're not all going to be available all the time and not take each others time for granted as its not always available. I was wondering where you were the last few days @prastle , was missing your cheery banter : )

DanVanAtta commented 4 years ago

I'm going to hide all posts not related to the immediate problem, again, if we want to discuss how to better communicate, lets do that in a dedicated thread and not in a thread dedicated to another important problem that is very technical..

Cernelius commented 4 years ago

Pasting from the current lobby chatlog, in case it might be of any use (feel free to mark this post as off-topic if not):

(20:02:32) Ostronald: are the bots stuck again? (20:02:46) panzerfury: looks wierd (20:02:58) sailingcondor: they arent resetting willy1 has left (20:03:13) Ostronald: so which are still workingß champ74 has joined (20:04:44) Ostronald: they all seem to be stuck :( jan_ has left (20:05:27) Navalland: hi (20:05:28) Ostronald: 203 (20:05:30) Ostronald: is working Ostronald slapped panzerfury (20:05:40) Cernel: hey (20:05:48) Cernel: 270bc wars? (20:05:56) Navalland: yes ll (20:06:05) Cernel: read my mind seancb slapped akaroadman (20:07:00) akaroadman: ?? (20:07:03) seancb: dc (20:07:04) seancb: save interstellar has left (20:07:36) Cernel: when a bot breaks down while you use it you are supposed to file a bug report, j ust letting everyone know seancb slapped akaroadman (20:08:01) Cernel: noone else can give as much context and help determining what caused the breakdo wn (20:08:29) akaroadman: 10 (20:08:34) akaroadman: 103 (20:09:04) GeneralEmCgee: every single bot I use on this new server has an atrocious lag in correlation wi th the game's reactive feedback to real time (20:09:20) GeneralEmCgee: it's bad, like real bad thekhan has joined thekhan slapped NGMC (20:10:18) GeneralEmCgee: can't play a single damn game on any of these servers without a gaping issue lik e this Patton77 has joined (20:10:26) NGMC: hi (20:10:28) NGMC: sorry (20:10:29) NGMC: in game (20:10:30) NGMC: :) (20:10:41) sailingcondor: revised anyone? (20:10:54) Cernel: ok guess I'll post this in the relevant thread, but you should really make a bug report yourself (20:10:58) Cernel: if you haven't yet Patton77 slapped sailingcondor (20:11:07) thekhan: you are not the only one that got problems with the bots (20:11:11) Cernel: I cannot really report what others are experiencing (20:11:12) Patton77: bot? (20:11:18) Cernel: I never use bots (virtually) (20:11:19) sailingcondor: where you be? (20:11:21) seancb: You were slapped by seancb (20:11:26) akaroadman: 304 (20:11:28) Patton77: 406 (20:11:28) Cernel: y? (20:11:30) sailingcondor: ok (20:11:41) seancb: is there any way to salvage a game we cannot save it (20:12:01) Cernel: ah that's a difficult question, I don't really know, probably you need prastle, what bot? (20:12:20) seancb: maybe we got it to save (20:12:23) seancb: i'll find out

EDIT: More:

(20:12:48) AdamF: some bots work, some do not KingKelly has left samstreeter has joined samstreeter has left (20:13:50) Patton77: MOD? (20:13:58) Patton77: is there a bot that works? ramonet has left (20:14:48) sailingcondor: so glad they did a reversion of the old version, which btw worked most of the ti me (20:14:50) thekhan: lol (20:15:29) thekhan: even happyer when the bugs are fixed and the bots work Sativa has left (20:15:44) sailingcondor: like why do we need to keep updating our cell phones every year. Im still trying to pay off the one I got two years ago (20:15:49) Patton77: tried 5 bots now (20:15:59) Cernel: when a bot stops working while you use it, please file a bug report giving as mu ch information as you can (20:16:08) AdamF: I did 2x cern (20:16:20) AdamF: Return email said "problem has been resolved" DSC has joined (20:16:33) Patton77: 6 DogfacePuffer has joined (20:16:36) Cernel: meanwhile, if you cannot find a working bot to use, any mod may relaunch a broke n one (unless it is deemed opportune to leave it broken) (20:16:51) AdamF: How do we relaunch? (20:17:04) sailingcondor: where you be Patton? (20:17:08) AdamF: Cannot change the game. Bots are 'stuck' (20:17:09) Cernel: hey AdamF thanks for the bug report, but I actually believe yours was closed bec ause of being duplicate (20:17:17) Cernel: so be sure to report in the other one (20:17:21) GeneralEmCgee: it's not that they don' work, but the gameplay is chalky as all hell, not even c lose to the smoothness of the old server (20:17:24) GeneralEmCgee: You were slapped by GeneralEmCgee (20:17:33) Patton77: can u make bot 102 work for example (20:17:38) Cernel: well this is the first time I hear about this (20:17:53) AdamF: OH (20:17:55) Cernel: and guess the developers won't know either if noone tells them Skippi has joined (20:18:04) AdamF: But its clearly not resolved (20:18:05) sailingcondor: the bots arent resetting (20:18:14) GeneralEmCgee: well I'd be more than happy to talk every moderators ear off about my frustratio ns with the lag :p (20:18:58) sailingcondor: here is a thought.,.. Go back to the original version... (20:19:24) GeneralEmCgee: here is a thought, that's a half measure solution (20:19:33) Cernel: actually moderators are not supposed to be developers or a link to developers, t he link is at the start of the chatlog (20:19:45) Cernel: Report any problems here: https://github.com/triplea-game/triplea/issues (20:20:33) Cernel: maybe the old version will be kept up longer than intended if the new one keeps having problems I don't know (20:20:49) Patton77: 401 (20:20:52) Patton77: finally Patton77 slapped sailingcondor (20:21:17) GeneralEmCgee: For what it's worth, removing map details SLIGHTLY helps (20:21:28) GeneralEmCgee: but it's still frustrating footcavalry has joined (20:21:38) GeneralEmCgee: I don't know if it's the new features implemented or what (20:22:05) Cernel: I relaunched bot 102 that should now work for some time at least Tjeerd has joined thekhan has left DogfacePuffer has left eboy82 has left (20:27:15) Patton77: bot 401 is broken (20:27:22) Patton77: bidibidibidi (20:27:59) footcavalry: so is 407-- i cannot reset game to new map (20:28:09) Patton77: 102 (20:28:13) Patton77: condor (20:28:18) sailingcondor: yes (20:28:39) Cernel: I'll disconnect 401 and relaunch 407 (20:28:47) Patton77: ty Tjeerd slapped NGMC (20:29:15) Tjeerd: hi ng, time for a game? (20:29:17) NGMC: hi (20:29:18) NGMC: in game (20:29:19) NGMC: ty (20:29:20) NGMC: :) (20:29:23) Tjeerd: where? (20:29:26) Patton77: 102 works! (20:29:29) NGMC: 306 footcavalry has left footcavalry has joined Silenroc has left Sativa has joined (20:31:50) Cernel: actually, nevermind on 401, to mee it seems that one works already (20:32:01) Cernel: or at least I can change the game (20:32:26) footcavalry: hi cernal--i am having difficulty in selecting/resetting maps (20:32:42) Cernel: I just changed the map on 401, works for me (20:32:58) Cernel: I relaunched 407 so you can try that one (20:34:27) Cernel: but keep saving since pretty sure it will break down again eventually

DanVanAtta commented 4 years ago

I installed a JDK on 3 of the 4 bot servers, I'll update the server scripting soon. With that I was able to take thread dumps. I believe I have found a way to predict which servers are dead. I checked on all of the 1xx, 3xx and 4xx servers that were in an available state and the ones that are dead have the following thread counts:

admin@li1232-106:~$ grep -i "thread.state"  thread-dump-102-broken  | sort | uniq -c
     12    java.lang.Thread.State: RUNNABLE
      4    java.lang.Thread.State: TIMED_WAITING (on object monitor)
      2    java.lang.Thread.State: TIMED_WAITING (parking)
      1    java.lang.Thread.State: TIMED_WAITING (sleeping)
      1    java.lang.Thread.State: WAITING (on object monitor)
     17    java.lang.Thread.State: WAITING (parking)

The ones that are alive have (notice +1 sleeping):

admin@li1383-13:~$ grep -i "thread.state"  thread-dump-403-working   | sort | uniq -c
     12    java.lang.Thread.State: RUNNABLE
      4    java.lang.Thread.State: TIMED_WAITING (on object monitor)
      2    java.lang.Thread.State: TIMED_WAITING (parking)
      2    java.lang.Thread.State: TIMED_WAITING (sleeping)
      1    java.lang.Thread.State: WAITING (on object monitor)
     17    java.lang.Thread.State: WAITING (parking)

The missing thread is:

   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(java.base@11.0.7/Native Method)
        at org.triplea.java.Interruptibles.lambda$sleep$1(Interruptibles.java:108)
        at org.triplea.java.Interruptibles$$Lambda$422/0x00000008402c2040.run(Unknown Source)
        at org.triplea.java.Interruptibles.lambda$await$0(Interruptibles.java:41)
        at org.triplea.java.Interruptibles$$Lambda$423/0x00000008402c2840.get(Unknown Source)
        at org.triplea.java.Interruptibles.awaitResult(Interruptibles.java:78)
        at org.triplea.java.Interruptibles.await(Interruptibles.java:39)
        at org.triplea.java.Interruptibles.sleep(Interruptibles.java:108)
        at org.triplea.game.server.HeadlessGameServer.lambda$waitForUsersHeadless$3(HeadlessGameServer.java:196)
        at org.triplea.game.server.HeadlessGameServer$$Lambda$404/0x00000008402bd440.run(Unknown Source)
        at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)
DanVanAtta commented 4 years ago

I was able to repro this problem with this diff:

Screenshot from 2020-06-29 22-01-09

DanVanAtta commented 4 years ago

I believe this is the problem:

DanVanAtta commented 4 years ago

ServerLauncher#launchInternal is the code block that is at play, the operative part is the catch block, the logging and then everything that follows:

  @Override
  void launchInternal(@Nullable final Void none) {
    try {
      isLaunching = false;
      abortLaunch = testShouldWeAbort();
      if (!abortLaunch) {
        if (!remotePlayers.isEmpty()) {
          warmUpCryptoRandomSource();
        }
        log.info("Starting Game Delegates.");
        serverGame.startGame();
      } else {
        stopGame();
      }
    } catch (final RuntimeException e) {
      // no-op, this is a simple player disconnect, no need to scare the user with some giant stack
      // trace
      final var cause = e.getCause();
      if (!(cause instanceof ConnectionLostException)) {
        if (cause instanceof MessengerException) {
          // we lost a connection
          // wait for the connection handler to notice, and shut us down
          Interruptibles.await(
              () -> {
                if (!abortLaunch
                    && !errorLatch.await(
                        ClientSetting.serverObserverJoinWaitTime.getValueOrThrow(),
                        TimeUnit.SECONDS)) {
                  log.warning("Waiting on error latch timed out!");
                }
              });
          stopGame();
        } else {
          final String errorMessage =
              "Unrecognized error occurred. If this is a repeatable error, "
                  + "please make a copy of this savegame and report to:\n"
                  + UrlConstants.GITHUB_ISSUES;
          log.log(Level.SEVERE, errorMessage, e);
          stopGame();
        }
      }
    }
    // having an oddball issue with the zip stream being closed while parsing to load default game.
    // might be
    // caused by closing of stream while unloading map resources.
    Interruptibles.sleep(200);
    // either game ended, or aborted, or a player left or disconnected
    launchAction.handleGameInterruption(gameSelectorModel, serverModel);
    serverModel.setServerLauncher(null);
    serverModel.newGame();
    if (inGameLobbyWatcher != null) {
      throw new HttpInteractionException(429, "simulated rate limit");
      //      inGameLobbyWatcher.setGameStatus(GameDescription.GameStatus.WAITING_FOR_PLAYERS,
      // null);
    }
    launchAction.onGameInterrupt();
    log.info("Game Status: Waiting For Players");
  }
DanVanAtta commented 4 years ago

This has been a vulnerability since 2013: 16954429, even before then, it sounds like headless bots had other issues. The 'chaos-monkey' theory with 429's seems to be correct. Essentially 'random' errors being thrown when a headless bot communicates with lobby is revealing poison-pill error handling. Because an error communicating with lobby can be transient and happen at any time, this would have always been a problem, just very rare (but still a potential problem).

This implies we may want to do testing with a 'chaos-monkey' setting of our own where we generate random errors communicating to server (side-note, 'chaos-monkey' is the name for a netflix tool that generates random network errors when services are communicating with another, essentially what we've seen here).

DanVanAtta commented 4 years ago

I've prepared three patches that should resolve this issue:

6944 is a root cause fix, after that patch bots will be put into a state where after a failure they can still launch a new game. The patch essentially ensures that the lobby communication component that could fail is done 'after' bot state reset thereby leaving the bot in a good state rather than borking the bot state reset.

6945 will give us better visibility if this problem should happen again, we'll see that the map is failing to change and why. It should speed up time-to-resolution if we see any similar issues going forward.

6948 turns off lobby rate limiting to bots which made this problem magnitudes more likely to happen. The lobby rate limiting could be causing other problems on other code paths and should help resolve any related but still as-of-yet unknown problems. That update buys us time to harden the bots and analyze all code paths that interact with the server and to simulate errors on those paths and ensure that bots deal with those appropriately and do not fall into this situation where the bot becomes unusable.

DanVanAtta commented 4 years ago

Alll bots and lobby have been patched with the 3 updates mentioned (all are on version 2.0.20333). This problem should no longer happen. Please open a new issue if we see anything else.