rollerderby / scoreboard

CRG Derby Scoreboard
Other
136 stars 55 forks source link

[Possible Issue] Reports of CRG hanging/freezing during games #538

Closed bullseye555 closed 2 years ago

bullseye555 commented 2 years ago

Hi @frank-weinberg - we've had some isolated reports come through of CRG v5.0.7 hanging/freezing during games, or during attempted export of statsbooks. There hasn't been a widespread report of issues, so they may be isolated - but would appreciate if you could have a look

Please see following screenshots from the FB Group Note that since this was raise, neither Jen or myself have had a chance to attempt to replicate this - and we used v5.0.6 at the event we had yesterday (we also thought there were some issues at a tournament being held in Queensland this weekend, but the issues they were having were apparently network related - they still have another day to run, and I've reached out to the team to get any further feedback on CRG issues not related to their network issue)

I had an issue when running a scrim with it last weekend, but as I'd been having issues with the device it was running on - I just chalked it up to the laptop having a hissy-fit, as a reboot resolved the issue (I did see the same memory and CPU spike in Java as shown below, though, which is why I mentioned it. It occurred during the change over from a 30 min single-half game, to a new game - Ruleset page wouldn't load to allow me to add a new ruleset, and I couldn't start a new game as the operator panel looked like that in the screenshots below from Isabelle)

It appears to happen later in the game and not towards the start. I'm going to aim to run a virtual game in the next 48 - 72 hours to see if I can replicate it and @JeneralPain was planning on running a debug Java session, so hopefully she can do that at some point too

1: image

2: (Isabelle is another Aussie NSO that Jen & I both know) image -- This photo taken on the night of the issue image

image -- This screenshot taken from the re-creation of the issue the following day image -- Screengrab recording of the video uploaded. I've let it loop a couple of times - the flickering occurs around about the 2 second mark of the video https://user-images.githubusercontent.com/28804763/193450284-03f5043f-506f-487f-a8b9-50982dce46de.mp4

frank-weinberg commented 2 years ago

Let's start at the bottom.

bullseye555 commented 2 years ago

Thanks @frank-weinberg

I've passed these comments back to Isabelle and Keleigh. Will come back to you with anything further - aiming to mimic a game tonight to see if I can replicate any of the issues

bullseye555 commented 2 years ago

Ran a game and experienced the hang and crash at the end of the game. Had Java doing Log & Debug, but it appears due to the crash the file has filed to write

Steps:

  1. Start 5.0.7 (Noting that this is base CRG with no custom port. Also no Statsbook loaded in folder or added to config)
  2. Import IGRF
  3. Run full game (2x 30 halves. Cut intermission from 15 mins to about 1)
  4. At completion of game, mark as Official Final
  5. Navigated away from Operator Panel At this point, the web page failed to successfully navigate to the Index page. Screenshot_1 Screenshot_2 I let it run without intervention, taking into consideration your note that it may take some time for checking After about 5 mins, Java closed and restarted Java window had the following after the restart Screenshot_3 Index page finally loaded about about 10 mins (5mins pre-crash + 5 mins post-crash) Then navigated to the Games/Team/Rulesets page, but nothing loaded Screenshot_4 Java screen had another error (repeat of the same one) Screenshot_5 I then manually ended Java and re-opened with the following Screenshot_6

I also noticed that the statsbook would briefly flashup as a TMP file in the base folder intermittently Screenshot_7

Was also able to navigate to the Operator Panel - showing lack of data Screenshot_8

Reviewing the CONFIG\AUTOSAVE folder, I can see the 0sec and 10sec JSONs are missing - I have NOT deleted these myself (they are also missing from the \BACKUP folder) Screenshot_9

I've zipped the full folder so you can have a look and see if there's anything unusual crg-scoreboard_v5.0.7.zip

Isabelle has also confirmed that there was only the laptop in the image (no other devices connected) and there was no one actively using the device at the time the data disappeared

bullseye555 commented 2 years ago

JSON files from Keleigh's 2 games she had issues with exporting from 5.0.7 (sorry, these are links. On my phone now about to go to bed and can't upload as txt from phone)

Game1

Game 2

frank-weinberg commented 2 years ago

I found the source for the errors on restart: For each connection the backend stores a copy of the data so it can hand it out as requested without having to wait for other connections or the part that does the processing. The error happens if the first request for data comes in before this copy is initialized. And on restarts the backend is very busy with loading the autosave etc. while the browser has the page already loaded and is ready to ask for data immediately. I'm not sure how (the webserver is a third party library), but somehow this appears to cause things to go bananas beyond that one connection causing the high system loads which in turn can explain subsequent page loads failing and the autosaves not being written properly.

This can also explain your initial failure (things started to go wrong on a page load and the index page is one that should have a very short time until the first data request) as well as Kel's (if she loaded up game 1 in order to export it).

I'm less sure about Isabelle's scenario - where would a new connection be initiated when the computer is just sitting there idle. It may be that the page that caused the issue was loaded earlier and it just took several minutes for the error to result in this connection showing symptoms. Or there is another gremlin lurking somewhere.

The good news is that fixing this issue is easy. So I'll release 5.0.8 after dinner and then let's hope that all of the symptoms were caused by the same root cause.

PS: The code in question is unchanged since 4.0 - the bug just didn't manifest until now. (It's becoming more likely as the data contains more games - 4.x only ever held one and with 5.x it looks like people got lucky so far - maybe the additional time to check if each json and xlsx exists and is readable is what tipped it over.)

frank-weinberg commented 2 years ago

5.0.8 is ready.

bullseye555 commented 2 years ago

Thanks for jumping on this so quickly @frank-weinberg !

frank-weinberg commented 2 years ago

Reopening - This is not fully fixed. Was running 5.0.8 on Windows at a tournament yesterday and when hitting "Final Score" after Game 1 the symptoms appeared. Restarting solved it in that case. After Game 2 at hitting "Final Score" the symptoms appeared again and this time a restart did not help. Went to 5.0.6 after that and the rest of the day went fine.

@bullseye555 could you please 1 - announce on facebook that for the time being people should use 5.0.6 on windows. For Linux 5.0.8 seems to be fine. 2 - run some tests on windows (try if this appears with a very short game to save time) where you get into a state of restarts failing and see if you can get out of that state by deleting the game-data files or selectively removing parts of the autosave. (I would do this myself but don't have access to a windows machine.)

bullseye555 commented 2 years ago

Will do post tonight after training

Also happy to spin up a user on my server that you can do some testing yourself on a Windows pc

On Sun, 9 Oct 2022, 16:57 frank-weinberg, @.***> wrote:

Reopening - This is not fully fixed. Was running 5.0.8 on Windows at a tournament yesterday and when hitting "Final Score" after Game 1 the symptoms appeared. Restarting solved it in that case. After Game 2 at hitting "Final Score" the symptoms appeared again and this time a restart did not help. Went to 5.0.6 after that and the rest of the day went fine.

@bullseye555 https://github.com/bullseye555 could you please 1 - announce on facebook that for the time being people should use 5.0.6 on windows. For Linux 5.0.8 seems to be fine. 2 - run some tests on windows (try if this appears with a very short game to save time) where you get into a state of restarts failing and see if you can get out of that state by deleting the game-data files or selectively removing parts of the autosave. (I would do this myself but don't have access to a windows machine.)

— Reply to this email directly, view it on GitHub https://github.com/rollerderby/scoreboard/issues/538#issuecomment-1272463341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG3YNG6ZXOVQY6JPCYSPESDWCJND5ANCNFSM6AAAAAAQ23MTOA . You are receiving this because you were mentioned.Message ID: @.***>

JeneralPain commented 2 years ago

Is announced

On Sun, Oct 9, 2022 at 4:57 PM frank-weinberg @.***> wrote:

Reopening - This is not fully fixed. Was running 5.0.8 on Windows at a tournament yesterday and when hitting "Final Score" after Game 1 the symptoms appeared. Restarting solved it in that case. After Game 2 at hitting "Final Score" the symptoms appeared again and this time a restart did not help. Went to 5.0.6 after that and the rest of the day went fine.

@bullseye555 https://github.com/bullseye555 could you please 1 - announce on facebook that for the time being people should use 5.0.6 on windows. For Linux 5.0.8 seems to be fine. 2 - run some tests on windows (try if this appears with a very short game to save time) where you get into a state of restarts failing and see if you can get out of that state by deleting the game-data files or selectively removing parts of the autosave. (I would do this myself but don't have access to a windows machine.)

— Reply to this email directly, view it on GitHub https://github.com/rollerderby/scoreboard/issues/538#issuecomment-1272463341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI4RNRDFZ6NLGSDDJEOYG3DWCJND5ANCNFSM6AAAAAAQ23MTOA . You are receiving this because you were mentioned.Message ID: @.***>

bullseye555 commented 2 years ago

Ran a 2x 10min half game (with 1 min intermission) Went to Official Score - no issue while remaining on SBO Panel, but as soon as I hit Back to index page everything hung Auto-Save 0sec file self deleted (not sure when, as I wasn't monitoring)

"Deleted" (moved to a different location) the game-data JSON and restarted CRG - java still using a LARGE amount of CPU ~5 mins after restarting Java

image

Restarted device Opened CRG and navigated to Operator panel Game restored to previous instance without adding in the 0sec JSON file, or the game-data JSON - looks to have used 10sec ago file, as per Java window

image

Let me know if you want me to try a longer game (15 or 20 min halves)

frank-weinberg commented 2 years ago

When you say no issue after official score - was the intermission clock still running normal? Because we observed it to become increasingly sluggish until it at some point came to a standstill. (I intended to mention this as something to look out for but then forgot about it when I eventually got around to make the comment.)

The disappearing auto saves are easy to explain: When it's time to create a new auto save the oldest one is deleted all the remaining ones are renamed to +10s and then a new one is created. Creation of the new one is then affected by the sluggishness.

frank-weinberg commented 2 years ago

And after digging through the code to see what changes when the score is set to final, I have a theory:

So I'm assuming that on windows the brief time between steps 2 and 3 is long enough for the readable check to fail causing another write and then one more when the file reappears. And each of those then creates another two writes and so on. Whereas on Linux the renaming is fast enough that the check finds the file readable again and thus no value changes and we do not do more writes.

frank-weinberg commented 2 years ago

The easiest way to test this:

bullseye555 commented 2 years ago

When you say no issue after official score - was the intermission clock still running normal?

I only let it count down for about 30 seconds. In hindsight, it did start looking a little sluggish towards the end before I navigated away from the page

The easiest way to test this:

  • Start a game
  • at any point between starting the game and making the score final go to the IGRF tab and click "Update"
  • make the score final at the end of the game and check for symptoms

I ran a quick 5 min half game, did an Update during half time, and didn't experience any issues at the end of the game

I won't have time to run a longer 'game' tonight, but will re-run a 10min half game tomorrow

frank-weinberg commented 2 years ago

When you run a longer game try this version - if my theory is correct it should fix the issue. crg-scoreboard_v5.0.8-dirty-20221012085016.zip

bullseye555 commented 2 years ago

@frank-weinberg - Apologies this tool a little longer than expected

Ran using the linked version and had no issues when running a 2x20min game

As before - ran the game end-to-end Set as Official Monitored Intermission clock for ~2mins (didn't seem to get laggy) Monitored data files (0sec ago etc) and all appeared to be creating correctly, with none missing like in the past Navigated away from Operator Panel and performed other operations without visible delay Able to Update and download JSON and Statsbook (noting that it did appear to take longer to perform the updated before I could download, but if that's the tradeoff so we don't have the bug - then that, to me, is an acceptable solution)

frank-weinberg commented 2 years ago

Thank you very much @bullseye555 - without your testing this would have been a lot harder to pin down and fix. I'll release 5.0.9 right away. (And the fix should not affect export times at all. Given your wording I assume that's either skewed perception or some other external factor.)

bullseye555 commented 2 years ago

Thanks @frank-weinberg for the quick resolve

5.0.9 announced in FB group