Closed prastle closed 4 years ago
@prastle I see the bots with updated system specs are still updated, would you mind converting them back to how they were. It's best if we can keep the bots homogeneous. It's not really surprising that it did not change anything as we have sized max memory usage with account for how many bots are run and there is plenty of free disk space.
Some questions:
The only solution until the devs can fix them is to restart the servers directly from Linode.
Please avoid restarting bot servers until we've had a chance to look at them.
Locked up means no one can join them locked up means no one can load a new game... THUS USELESS Only fix atm is restart linode And btw I changed nothing other than a resize they are still YOUR BOTS that you loaded none of these problems in 1.9 @DanVanAtta If you join the lobby and snoop around I am certain you will find half or most locked up" since my last reload was 6 hrs? ago go have a look
And to be said if I had avoided restarting bots all players would have not played for 2.5 days @DanVanAtta I did mention this as did silver @many places and to all devs
@prastle no need for the hyperbole or the caps, keep calm.
If nobody can join a locked up bot, how can you change map?
For the bots that you can join, was anything done to them?
Have you tried running a moderator 'shutdown' on the "locked up bots".
Please do revert the system settings back to what they were since they did not help.
Understood that some bots needed help, I don't think it would have been 2.5 days though; for such serious situations do not wait for 2 days to post a bug report.
Is it the case that some bots you can join, and others you can? Though, when you can join a bot, the map selection might be broken?
And btw I changed nothing other than a resize they are still YOUR BOTS that you loaded none of these problems in 1.9
We're in this together, it is the project and we are a team. I try to avoid the words "mine", "yours", when talking about the project. It is not my project, a persons project, it is not my code, nor 'your code', it belongs to TripleA. Part and parcel is that people will all rotate in and out of the project, and the maintainers are the group ultimately that will need to fix (or not fix) any problems.
I'm not terribly surprised we're seeing some 2.0 pains with bots. The headless bot code is full of hacks and was done in an utter rush to enable players to join games without port forwarding.
Yes some you can join and some you cannot it atm seems like a time thing. Thus why I added more ram and hard drive to no effect. Simple answer All bots locked after a few hrs unless they still have a game going that has not crashed.
Moderator shutdown makes the bot reappear still broken a few seconds later so that is a no go as well... In addition to this since the ability to boot a ghost is gone for mods we might be dealing with 2 or three problems here
first report was created on Friday it is Sunday end of story. Granted I was camping and did not make the first report.
@prastle this issue was created 10 hours ago. I think this highlights how problematic it is that our communication is still split between multiple channels.
Moderator shutdown makes the bot reappear still broken a few seconds later so that is a no go as well...
Do you have any bot numbers where this was the case?
I think we probably need a way to log moderator actions. I'm trying to correlate events in the logs but not knowing what happened to which bots, and which were in what state is making it very difficult.
In addition to this since the ability to boot a ghost is gone for mods we might be dealing with 2 or three problems here
Maybe, but 'ghost' players are probably a side-effect. Running a shutdown should get rid of them.
The last moderator shut down I ran as a test was after you asked earlier above ^^ on the same bot I asked you to look at when you were live in the lobby with me...
bobstein has joined LaFayette has joined Willfistwu has left nicmontreal44 has left bobstein has left Sativa has left Vlada has left (4:53:53 p.m.) Maraxus: global game anyone?' DwayneDibly has joined DwayneDibly slapped Maraxus (4:57:13 p.m.) DwayneDibly: Anyone looking for a game? prastle slapped LaFayette bobstein has joined (4:57:42 p.m.) prastle: join a few :) you will see MississippiKid slapped RickJame5 andrewthree has joined (4:58:44 p.m.) prastle: example 307 Cali andrewthree has left frank-the-tank has left eboy82 has left (5:01:37 p.m.) prastle: well you could stay long enough to test (5:01:49 p.m.) prastle: example launch the save or switch to a new map (5:01:59 p.m.) prastle: no matter Wassmuss has left
This was bot 307 California which rejoined after I moderator stopped it a few seconds later...
After this post I can add some interesting news... Apparently my ghost is still in that bot from when I was trying to show you yours left... 742 pm ish bot 307 Cali
I believe this problem has been reported 2 days ago: https://github.com/triplea-game/triplea/issues/6875
To me, shutdown appears working fine (I assume the host will eventually break down again, though).
However, I really think "shutdown" should be renamed to something better, to explain you are bringing the host back at its initial state. The name really makes me think you are nuking the bot definitively, and I think moderators might not use it because of not knowing what they are actually doing to the host.
A rename would be good cer Since it rarely accomplishes the goal of a restart
I actually operated several "shutdowns" on bots broken as per what described at this issue, and it seemed always to work as I understand it should. However, I've no idea how long the host remained in working condition, thereafter.
join 307 Cer I have not fixed it and note it will not switch a map for me or launch
Good bug reports help get problems fixed faster. We need to isolate what is happening, a clear explanation of where we are seeing the problem is really useful.
Steps to reproduce should be a list of what needs to be done to see the problem. Joining bots (and failing), changing maps, and starting a game seem to be the three observable ways to trigger this problem.
I'm curious if there are any error messages that we have observed during all three of those. For changing maps, I saw the error was silent, that simply nothing happened.
Getting bot numbers on any of these issues would be helpful as well so we know where to look in the logs (along with time stamps of when problems happened).
Substantially the problem with bots is that, here, the ones that should have done the bug reporting are obviously the users experiencing the problems (thus, for example, not me, as I virtually never use bots). However, almost only relatively advanced users ever report anything, or make good bug reports (as you say), and such users I believe are usually not bot users.
I think the only proper bug report, made in the proper place by a user experiencing the problem is the one that I already linked, but it seems got overlooked by everyone (at least for the past 2 days).
Either way 207 is locked atm and I am sure a few others... Sniff around gents I am going to bed..
It's a tough balance between asking detailed questions and users not doing anything. Really if the prompts are read and followed, the bug report should be quite quality. A problem is people tend to want to say "this is not working" and leave it at that. Which always leads to obvious questions "what is not working", "what does that even mean, error message, silent failure, something else?,"
Regardless, the finger pointing is not interesting. It's valuable though us to be concise and specific.
My working theory on this problem is from server side 429s. We typically would not expect them as the bot should be 'well-behaved.'
I think the confluence of uploading chat messages and sending stack traces as individual chat messages is really the trigger. The gameOverException
that we have had for some time, generally benign and arguably an error handling bug as the 'exception' seems to happen in exceptionally normal situations where users just use the application normally. This triggers a potentially bad sequence of events:
Any subsequent updates within the next minute can then hit a rate limit (429) error from server. Testing out a few code paths, an exception on updateStatus
can lead to a dead game that you cannot join.
Short term, we can turn rate limiting to 'reporting mode' so it is no longer enforcing. Perhaps we should also change bots to only report an exception message and not the whole stack trace.
Long term, we'll likely need to harden any bot interaction with the server. In theory it was a likely vulnerability already, to some extent the rate limiting has been a 'chaos monkey' experiment with how bots handle failed requests to the server (apparently not very well after observing some errors injected).
I have not been able to reproduce the same behavior we are seeing where map switches fail silently. I wonder if the game update 429 errors are often killing the java socket acceptor thread which then kills communication to the server. I suspect there must be more to it as the failure to change maps is a silent error, no error message on the server (so I would have expected an error sending the message to have been reported client side, perhaps there is no error or there is one and we are suppressing it).
I created the issue out of exasperation about 12 hrs after the first Linode reset because the devs had not responded to any of the issues that the players posted or mods posted... Lets not forget I was camping. So Yes I 100% agree that the way devs receive important info is failing... :)
We should probably extract this into a new issue if we want to dive into the post-mortem.
I was happily camping as well, so we should also bake in that we're not all going to be available all the time and not take each others time for granted as its not always available. I was wondering where you were the last few days @prastle , was missing your cheery banter : )
I'm going to hide all posts not related to the immediate problem, again, if we want to discuss how to better communicate, lets do that in a dedicated thread and not in a thread dedicated to another important problem that is very technical..
Pasting from the current lobby chatlog, in case it might be of any use (feel free to mark this post as off-topic if not):
(20:02:32) Ostronald: are the bots stuck again? (20:02:46) panzerfury: looks wierd (20:02:58) sailingcondor: they arent resetting willy1 has left (20:03:13) Ostronald: so which are still workingß champ74 has joined (20:04:44) Ostronald: they all seem to be stuck :( jan_ has left (20:05:27) Navalland: hi (20:05:28) Ostronald: 203 (20:05:30) Ostronald: is working Ostronald slapped panzerfury (20:05:40) Cernel: hey (20:05:48) Cernel: 270bc wars? (20:05:56) Navalland: yes ll (20:06:05) Cernel: read my mind seancb slapped akaroadman (20:07:00) akaroadman: ?? (20:07:03) seancb: dc (20:07:04) seancb: save interstellar has left (20:07:36) Cernel: when a bot breaks down while you use it you are supposed to file a bug report, j ust letting everyone know seancb slapped akaroadman (20:08:01) Cernel: noone else can give as much context and help determining what caused the breakdo wn (20:08:29) akaroadman: 10 (20:08:34) akaroadman: 103 (20:09:04) GeneralEmCgee: every single bot I use on this new server has an atrocious lag in correlation wi th the game's reactive feedback to real time (20:09:20) GeneralEmCgee: it's bad, like real bad thekhan has joined thekhan slapped NGMC (20:10:18) GeneralEmCgee: can't play a single damn game on any of these servers without a gaping issue lik e this Patton77 has joined (20:10:26) NGMC: hi (20:10:28) NGMC: sorry (20:10:29) NGMC: in game (20:10:30) NGMC: :) (20:10:41) sailingcondor: revised anyone? (20:10:54) Cernel: ok guess I'll post this in the relevant thread, but you should really make a bug report yourself (20:10:58) Cernel: if you haven't yet Patton77 slapped sailingcondor (20:11:07) thekhan: you are not the only one that got problems with the bots (20:11:11) Cernel: I cannot really report what others are experiencing (20:11:12) Patton77: bot? (20:11:18) Cernel: I never use bots (virtually) (20:11:19) sailingcondor: where you be? (20:11:21) seancb: You were slapped by seancb (20:11:26) akaroadman: 304 (20:11:28) Patton77: 406 (20:11:28) Cernel: y? (20:11:30) sailingcondor: ok (20:11:41) seancb: is there any way to salvage a game we cannot save it (20:12:01) Cernel: ah that's a difficult question, I don't really know, probably you need prastle, what bot? (20:12:20) seancb: maybe we got it to save (20:12:23) seancb: i'll find out
EDIT: More:
(20:12:48) AdamF: some bots work, some do not KingKelly has left samstreeter has joined samstreeter has left (20:13:50) Patton77: MOD? (20:13:58) Patton77: is there a bot that works? ramonet has left (20:14:48) sailingcondor: so glad they did a reversion of the old version, which btw worked most of the ti me (20:14:50) thekhan: lol (20:15:29) thekhan: even happyer when the bugs are fixed and the bots work Sativa has left (20:15:44) sailingcondor: like why do we need to keep updating our cell phones every year. Im still trying to pay off the one I got two years ago (20:15:49) Patton77: tried 5 bots now (20:15:59) Cernel: when a bot stops working while you use it, please file a bug report giving as mu ch information as you can (20:16:08) AdamF: I did 2x cern (20:16:20) AdamF: Return email said "problem has been resolved" DSC has joined (20:16:33) Patton77: 6 DogfacePuffer has joined (20:16:36) Cernel: meanwhile, if you cannot find a working bot to use, any mod may relaunch a broke n one (unless it is deemed opportune to leave it broken) (20:16:51) AdamF: How do we relaunch? (20:17:04) sailingcondor: where you be Patton? (20:17:08) AdamF: Cannot change the game. Bots are 'stuck' (20:17:09) Cernel: hey AdamF thanks for the bug report, but I actually believe yours was closed bec ause of being duplicate (20:17:17) Cernel: so be sure to report in the other one (20:17:21) GeneralEmCgee: it's not that they don' work, but the gameplay is chalky as all hell, not even c lose to the smoothness of the old server (20:17:24) GeneralEmCgee: You were slapped by GeneralEmCgee (20:17:33) Patton77: can u make bot 102 work for example (20:17:38) Cernel: well this is the first time I hear about this (20:17:53) AdamF: OH (20:17:55) Cernel: and guess the developers won't know either if noone tells them Skippi has joined (20:18:04) AdamF: But its clearly not resolved (20:18:05) sailingcondor: the bots arent resetting (20:18:14) GeneralEmCgee: well I'd be more than happy to talk every moderators ear off about my frustratio ns with the lag :p (20:18:58) sailingcondor: here is a thought.,.. Go back to the original version... (20:19:24) GeneralEmCgee: here is a thought, that's a half measure solution (20:19:33) Cernel: actually moderators are not supposed to be developers or a link to developers, t he link is at the start of the chatlog (20:19:45) Cernel: Report any problems here: https://github.com/triplea-game/triplea/issues (20:20:33) Cernel: maybe the old version will be kept up longer than intended if the new one keeps having problems I don't know (20:20:49) Patton77: 401 (20:20:52) Patton77: finally Patton77 slapped sailingcondor (20:21:17) GeneralEmCgee: For what it's worth, removing map details SLIGHTLY helps (20:21:28) GeneralEmCgee: but it's still frustrating footcavalry has joined (20:21:38) GeneralEmCgee: I don't know if it's the new features implemented or what (20:22:05) Cernel: I relaunched bot 102 that should now work for some time at least Tjeerd has joined thekhan has left DogfacePuffer has left eboy82 has left (20:27:15) Patton77: bot 401 is broken (20:27:22) Patton77: bidibidibidi (20:27:59) footcavalry: so is 407-- i cannot reset game to new map (20:28:09) Patton77: 102 (20:28:13) Patton77: condor (20:28:18) sailingcondor: yes (20:28:39) Cernel: I'll disconnect 401 and relaunch 407 (20:28:47) Patton77: ty Tjeerd slapped NGMC (20:29:15) Tjeerd: hi ng, time for a game? (20:29:17) NGMC: hi (20:29:18) NGMC: in game (20:29:19) NGMC: ty (20:29:20) NGMC: :) (20:29:23) Tjeerd: where? (20:29:26) Patton77: 102 works! (20:29:29) NGMC: 306 footcavalry has left footcavalry has joined Silenroc has left Sativa has joined (20:31:50) Cernel: actually, nevermind on 401, to mee it seems that one works already (20:32:01) Cernel: or at least I can change the game (20:32:26) footcavalry: hi cernal--i am having difficulty in selecting/resetting maps (20:32:42) Cernel: I just changed the map on 401, works for me (20:32:58) Cernel: I relaunched 407 so you can try that one (20:34:27) Cernel: but keep saving since pretty sure it will break down again eventually
I installed a JDK on 3 of the 4 bot servers, I'll update the server scripting soon. With that I was able to take thread dumps. I believe I have found a way to predict which servers are dead. I checked on all of the 1xx, 3xx and 4xx servers that were in an available state and the ones that are dead have the following thread counts:
admin@li1232-106:~$ grep -i "thread.state" thread-dump-102-broken | sort | uniq -c
12 java.lang.Thread.State: RUNNABLE
4 java.lang.Thread.State: TIMED_WAITING (on object monitor)
2 java.lang.Thread.State: TIMED_WAITING (parking)
1 java.lang.Thread.State: TIMED_WAITING (sleeping)
1 java.lang.Thread.State: WAITING (on object monitor)
17 java.lang.Thread.State: WAITING (parking)
The ones that are alive have (notice +1 sleeping):
admin@li1383-13:~$ grep -i "thread.state" thread-dump-403-working | sort | uniq -c
12 java.lang.Thread.State: RUNNABLE
4 java.lang.Thread.State: TIMED_WAITING (on object monitor)
2 java.lang.Thread.State: TIMED_WAITING (parking)
2 java.lang.Thread.State: TIMED_WAITING (sleeping)
1 java.lang.Thread.State: WAITING (on object monitor)
17 java.lang.Thread.State: WAITING (parking)
The missing thread is:
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(java.base@11.0.7/Native Method)
at org.triplea.java.Interruptibles.lambda$sleep$1(Interruptibles.java:108)
at org.triplea.java.Interruptibles$$Lambda$422/0x00000008402c2040.run(Unknown Source)
at org.triplea.java.Interruptibles.lambda$await$0(Interruptibles.java:41)
at org.triplea.java.Interruptibles$$Lambda$423/0x00000008402c2840.get(Unknown Source)
at org.triplea.java.Interruptibles.awaitResult(Interruptibles.java:78)
at org.triplea.java.Interruptibles.await(Interruptibles.java:39)
at org.triplea.java.Interruptibles.sleep(Interruptibles.java:108)
at org.triplea.game.server.HeadlessGameServer.lambda$waitForUsersHeadless$3(HeadlessGameServer.java:196)
at org.triplea.game.server.HeadlessGameServer$$Lambda$404/0x00000008402bd440.run(Unknown Source)
at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)
I was able to repro this problem with this diff:
I believe this is the problem:
launchAction.onGameInterrupt
to not be executedonGameInterrupt
is not executed, the server game instance is left non-null and causes the map to not changeonGameInterrupt
inovkes waitForUsersHeadless()
which spawns the thread that will start games. If onGameInterrupt
is not executed, then the game-starting thread is not spawned and new games will not startServerLauncher#launchInternal is the code block that is at play, the operative part is the catch
block, the logging and then everything that follows:
@Override
void launchInternal(@Nullable final Void none) {
try {
isLaunching = false;
abortLaunch = testShouldWeAbort();
if (!abortLaunch) {
if (!remotePlayers.isEmpty()) {
warmUpCryptoRandomSource();
}
log.info("Starting Game Delegates.");
serverGame.startGame();
} else {
stopGame();
}
} catch (final RuntimeException e) {
// no-op, this is a simple player disconnect, no need to scare the user with some giant stack
// trace
final var cause = e.getCause();
if (!(cause instanceof ConnectionLostException)) {
if (cause instanceof MessengerException) {
// we lost a connection
// wait for the connection handler to notice, and shut us down
Interruptibles.await(
() -> {
if (!abortLaunch
&& !errorLatch.await(
ClientSetting.serverObserverJoinWaitTime.getValueOrThrow(),
TimeUnit.SECONDS)) {
log.warning("Waiting on error latch timed out!");
}
});
stopGame();
} else {
final String errorMessage =
"Unrecognized error occurred. If this is a repeatable error, "
+ "please make a copy of this savegame and report to:\n"
+ UrlConstants.GITHUB_ISSUES;
log.log(Level.SEVERE, errorMessage, e);
stopGame();
}
}
}
// having an oddball issue with the zip stream being closed while parsing to load default game.
// might be
// caused by closing of stream while unloading map resources.
Interruptibles.sleep(200);
// either game ended, or aborted, or a player left or disconnected
launchAction.handleGameInterruption(gameSelectorModel, serverModel);
serverModel.setServerLauncher(null);
serverModel.newGame();
if (inGameLobbyWatcher != null) {
throw new HttpInteractionException(429, "simulated rate limit");
// inGameLobbyWatcher.setGameStatus(GameDescription.GameStatus.WAITING_FOR_PLAYERS,
// null);
}
launchAction.onGameInterrupt();
log.info("Game Status: Waiting For Players");
}
This has been a vulnerability since 2013: 16954429, even before then, it sounds like headless bots had other issues. The 'chaos-monkey' theory with 429's seems to be correct. Essentially 'random' errors being thrown when a headless bot communicates with lobby is revealing poison-pill error handling. Because an error communicating with lobby can be transient and happen at any time, this would have always been a problem, just very rare (but still a potential problem).
This implies we may want to do testing with a 'chaos-monkey' setting of our own where we generate random errors communicating to server (side-note, 'chaos-monkey' is the name for a netflix tool that generates random network errors when services are communicating with another, essentially what we've seen here).
I've prepared three patches that should resolve this issue:
Alll bots and lobby have been patched with the 3 updates mentioned (all are on version 2.0.20333). This problem should no longer happen. Please open a new issue if we see anything else.
How can the problem be recreated?
@SilverBullet @RoiEXLab @DanVanAtta @ron-murhammer The bots appear to lock up after an indeterminate number of rounds. With virtually most maps it appears not just revised. I resized three servers to double their ram and hard drive size in case it was a memory problem. This has not fixed anything. The only solution until the devs can fix them is to restart the servers directly from Linode. This appears to fix the problem temporarily. Hopefully they can get the headless bots fixed shortly since their does not appear to be any problems with individuals that host privately.
Game Version: Latest stable lobby bot servers.
Is there anything else we should know?
It appears to work again for awhile if the linode is restarted.