Closed bullseye555 closed 2 years ago
Let's start at the bottom.
Thanks @frank-weinberg
I've passed these comments back to Isabelle and Keleigh. Will come back to you with anything further - aiming to mimic a game tonight to see if I can replicate any of the issues
Ran a game and experienced the hang and crash at the end of the game. Had Java doing Log & Debug, but it appears due to the crash the file has filed to write
Steps:
I also noticed that the statsbook would briefly flashup as a TMP file in the base folder intermittently
Was also able to navigate to the Operator Panel - showing lack of data
Reviewing the CONFIG\AUTOSAVE folder, I can see the 0sec and 10sec JSONs are missing - I have NOT deleted these myself (they are also missing from the \BACKUP folder)
I've zipped the full folder so you can have a look and see if there's anything unusual crg-scoreboard_v5.0.7.zip
Isabelle has also confirmed that there was only the laptop in the image (no other devices connected) and there was no one actively using the device at the time the data disappeared
I found the source for the errors on restart: For each connection the backend stores a copy of the data so it can hand it out as requested without having to wait for other connections or the part that does the processing. The error happens if the first request for data comes in before this copy is initialized. And on restarts the backend is very busy with loading the autosave etc. while the browser has the page already loaded and is ready to ask for data immediately. I'm not sure how (the webserver is a third party library), but somehow this appears to cause things to go bananas beyond that one connection causing the high system loads which in turn can explain subsequent page loads failing and the autosaves not being written properly.
This can also explain your initial failure (things started to go wrong on a page load and the index page is one that should have a very short time until the first data request) as well as Kel's (if she loaded up game 1 in order to export it).
I'm less sure about Isabelle's scenario - where would a new connection be initiated when the computer is just sitting there idle. It may be that the page that caused the issue was loaded earlier and it just took several minutes for the error to result in this connection showing symptoms. Or there is another gremlin lurking somewhere.
The good news is that fixing this issue is easy. So I'll release 5.0.8 after dinner and then let's hope that all of the symptoms were caused by the same root cause.
PS: The code in question is unchanged since 4.0 - the bug just didn't manifest until now. (It's becoming more likely as the data contains more games - 4.x only ever held one and with 5.x it looks like people got lucky so far - maybe the additional time to check if each json and xlsx exists and is readable is what tipped it over.)
5.0.8 is ready.
Thanks for jumping on this so quickly @frank-weinberg !
Reopening - This is not fully fixed. Was running 5.0.8 on Windows at a tournament yesterday and when hitting "Final Score" after Game 1 the symptoms appeared. Restarting solved it in that case. After Game 2 at hitting "Final Score" the symptoms appeared again and this time a restart did not help. Went to 5.0.6 after that and the rest of the day went fine.
@bullseye555 could you please 1 - announce on facebook that for the time being people should use 5.0.6 on windows. For Linux 5.0.8 seems to be fine. 2 - run some tests on windows (try if this appears with a very short game to save time) where you get into a state of restarts failing and see if you can get out of that state by deleting the game-data files or selectively removing parts of the autosave. (I would do this myself but don't have access to a windows machine.)
Will do post tonight after training
Also happy to spin up a user on my server that you can do some testing yourself on a Windows pc
On Sun, 9 Oct 2022, 16:57 frank-weinberg, @.***> wrote:
Reopening - This is not fully fixed. Was running 5.0.8 on Windows at a tournament yesterday and when hitting "Final Score" after Game 1 the symptoms appeared. Restarting solved it in that case. After Game 2 at hitting "Final Score" the symptoms appeared again and this time a restart did not help. Went to 5.0.6 after that and the rest of the day went fine.
@bullseye555 https://github.com/bullseye555 could you please 1 - announce on facebook that for the time being people should use 5.0.6 on windows. For Linux 5.0.8 seems to be fine. 2 - run some tests on windows (try if this appears with a very short game to save time) where you get into a state of restarts failing and see if you can get out of that state by deleting the game-data files or selectively removing parts of the autosave. (I would do this myself but don't have access to a windows machine.)
— Reply to this email directly, view it on GitHub https://github.com/rollerderby/scoreboard/issues/538#issuecomment-1272463341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG3YNG6ZXOVQY6JPCYSPESDWCJND5ANCNFSM6AAAAAAQ23MTOA . You are receiving this because you were mentioned.Message ID: @.***>
Is announced
On Sun, Oct 9, 2022 at 4:57 PM frank-weinberg @.***> wrote:
Reopening - This is not fully fixed. Was running 5.0.8 on Windows at a tournament yesterday and when hitting "Final Score" after Game 1 the symptoms appeared. Restarting solved it in that case. After Game 2 at hitting "Final Score" the symptoms appeared again and this time a restart did not help. Went to 5.0.6 after that and the rest of the day went fine.
@bullseye555 https://github.com/bullseye555 could you please 1 - announce on facebook that for the time being people should use 5.0.6 on windows. For Linux 5.0.8 seems to be fine. 2 - run some tests on windows (try if this appears with a very short game to save time) where you get into a state of restarts failing and see if you can get out of that state by deleting the game-data files or selectively removing parts of the autosave. (I would do this myself but don't have access to a windows machine.)
— Reply to this email directly, view it on GitHub https://github.com/rollerderby/scoreboard/issues/538#issuecomment-1272463341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI4RNRDFZ6NLGSDDJEOYG3DWCJND5ANCNFSM6AAAAAAQ23MTOA . You are receiving this because you were mentioned.Message ID: @.***>
Ran a 2x 10min half game (with 1 min intermission) Went to Official Score - no issue while remaining on SBO Panel, but as soon as I hit Back to index page everything hung Auto-Save 0sec file self deleted (not sure when, as I wasn't monitoring)
"Deleted" (moved to a different location) the game-data JSON and restarted CRG - java still using a LARGE amount of CPU ~5 mins after restarting Java
Restarted device Opened CRG and navigated to Operator panel Game restored to previous instance without adding in the 0sec JSON file, or the game-data JSON - looks to have used 10sec ago file, as per Java window
Let me know if you want me to try a longer game (15 or 20 min halves)
When you say no issue after official score - was the intermission clock still running normal? Because we observed it to become increasingly sluggish until it at some point came to a standstill. (I intended to mention this as something to look out for but then forgot about it when I eventually got around to make the comment.)
The disappearing auto saves are easy to explain: When it's time to create a new auto save the oldest one is deleted all the remaining ones are renamed to +10s and then a new one is created. Creation of the new one is then affected by the sluggishness.
And after digging through the code to see what changes when the score is set to final, I have a theory:
So I'm assuming that on windows the brief time between steps 2 and 3 is long enough for the readable check to fail causing another write and then one more when the file reappears. And each of those then creates another two writes and so on. Whereas on Linux the renaming is fast enough that the check finds the file readable again and thus no value changes and we do not do more writes.
The easiest way to test this:
When you say no issue after official score - was the intermission clock still running normal?
I only let it count down for about 30 seconds. In hindsight, it did start looking a little sluggish towards the end before I navigated away from the page
The easiest way to test this:
- Start a game
- at any point between starting the game and making the score final go to the IGRF tab and click "Update"
- make the score final at the end of the game and check for symptoms
I ran a quick 5 min half game, did an Update during half time, and didn't experience any issues at the end of the game
I won't have time to run a longer 'game' tonight, but will re-run a 10min half game tomorrow
When you run a longer game try this version - if my theory is correct it should fix the issue. crg-scoreboard_v5.0.8-dirty-20221012085016.zip
@frank-weinberg - Apologies this tool a little longer than expected
Ran using the linked version and had no issues when running a 2x20min game
As before - ran the game end-to-end Set as Official Monitored Intermission clock for ~2mins (didn't seem to get laggy) Monitored data files (0sec ago etc) and all appeared to be creating correctly, with none missing like in the past Navigated away from Operator Panel and performed other operations without visible delay Able to Update and download JSON and Statsbook (noting that it did appear to take longer to perform the updated before I could download, but if that's the tradeoff so we don't have the bug - then that, to me, is an acceptable solution)
Thank you very much @bullseye555 - without your testing this would have been a lot harder to pin down and fix. I'll release 5.0.9 right away. (And the fix should not affect export times at all. Given your wording I assume that's either skewed perception or some other external factor.)
Thanks @frank-weinberg for the quick resolve
5.0.9 announced in FB group
Hi @frank-weinberg - we've had some isolated reports come through of CRG v5.0.7 hanging/freezing during games, or during attempted export of statsbooks. There hasn't been a widespread report of issues, so they may be isolated - but would appreciate if you could have a look
Please see following screenshots from the FB Group Note that since this was raise, neither Jen or myself have had a chance to attempt to replicate this - and we used v5.0.6 at the event we had yesterday (we also thought there were some issues at a tournament being held in Queensland this weekend, but the issues they were having were apparently network related - they still have another day to run, and I've reached out to the team to get any further feedback on CRG issues not related to their network issue)
I had an issue when running a scrim with it last weekend, but as I'd been having issues with the device it was running on - I just chalked it up to the laptop having a hissy-fit, as a reboot resolved the issue (I did see the same memory and CPU spike in Java as shown below, though, which is why I mentioned it. It occurred during the change over from a 30 min single-half game, to a new game - Ruleset page wouldn't load to allow me to add a new ruleset, and I couldn't start a new game as the operator panel looked like that in the screenshots below from Isabelle)
It appears to happen later in the game and not towards the start. I'm going to aim to run a virtual game in the next 48 - 72 hours to see if I can replicate it and @JeneralPain was planning on running a debug Java session, so hopefully she can do that at some point too
1:
2: (Isabelle is another Aussie NSO that Jen & I both know) -- This photo taken on the night of the issue
-- This screenshot taken from the re-creation of the issue the following day -- Screengrab recording of the video uploaded. I've let it loop a couple of times - the flickering occurs around about the 2 second mark of the video https://user-images.githubusercontent.com/28804763/193450284-03f5043f-506f-487f-a8b9-50982dce46de.mp4