Open mattpen opened 2 years ago
This happened again today, for ESP in rw
. It looks like it built and deployed the html file successfully, but did not update energy-skate-park.xml
. This is acceptable, since according to babel the screen names and title did not change in this translation.
This happened today for a production build of geometric-optics-basics 1.2.0. This is a new sim, and the first time that grunt production
has been run.
This happened twice on 11/29:
@pixelzoom reported on Slack that he redeployed GOB 1.2 and it was successful the second time. I reran the failed translation (ESP in lg) and the build succeeded.
This happened again today for my-solar-system, for a prototype deploy. A redeploy solved the problem.
From @zepumph's comment in #290:
This is occurring on the build-server in production ever since we upgraded to 19 over in https://github.com/phetsims/perennial/issues/268.
@jonathanolson recently that my-solar-system failed yesterday. The build-server logs looked like:
Dec 05 15:00:32 phet-server2.int.colorado.edu build-server[3140307]: error: Could not parse screen names from sim: TimeoutError: Waiting failed: 30000ms exceeded
Dec 05 15:00:32 phet-server2.int.colorado.edu build-server[3140307]: error: Stop status: waitForFunction phet
Dec 05 15:00:32 phet-server2.int.colorado.edu build-server[3140307]: error: TimeoutError: Waiting failed: 30000ms exceeded
Dec 05 15:00:32 phet-server2.int.colorado.edu build-server[3140307]: at Timeout.<anonymous> (/data/share/phet/build-server/perennial/node_modules/puppeteer-core/lib/cjs/puppeteer/common/WaitTask.js:64:32)
Dec 05 15:00:32 phet-server2.int.colorado.edu build-server[3140307]: at listOnTimeout (node:internal/timers:559:17)
Dec 05 15:00:32 phet-server2.int.colorado.edu build-server[3140307]: at processTimers (node:internal/timers:502:7)
Dec 05 15:00:32 phet-server2.int.colorado.edu build-server[3140307]: error: BUILD ABORTED! TimeoutError: Waiting failed: 30000ms exceeded
Dec 05 15:00:32 phet-server2.int.colorado.edu build-server[3140307]: error: TimeoutError: Waiting failed: 30000ms exceeded
Dec 05 15:00:32 phet-server2.int.colorado.edu build-server[3140307]: at Timeout.<anonymous> (/data/share/phet/build-server/perennial/node_modules/puppeteer-core/lib/cjs/puppeteer/common/WaitTask.js:64:32)
Dec 05 15:00:32 phet-server2.int.colorado.edu build-server[3140307]: at listOnTimeout (node:internal/timers:559:17)
Dec 05 15:00:32 phet-server2.int.colorado.edu build-server[3140307]: at processTimers (node:internal/timers:502:7)
Dec 05 15:00:32 phet-server2.int.colorado.edu build-server[3140307]: info: build aborted: checking out master for every repo in case build shas are still checked out
Dec 05 15:00:32 phet-server2.int.colorado.edu build-server[3140307]: debug: Running grunt checkout-master-all from .
Dec 05 15:00:32 phet-server2.int.colorado.edu build-server[3140307]: debug: stdout: Running "checkout-master-all" task
Dec 05 15:00:33 phet-server2.int.colorado.edu build-server[3140307]: debug: stdout: error in git checkout master for repo number-compare
Dec 05 15:00:33 phet-server2.int.colorado.edu build-server[3140307]: debug: stdout:
Dec 05 15:00:33 phet-server2.int.colorado.edu build-server[3140307]: debug: stdout: Done.
Dec 05 15:00:33 phet-server2.int.colorado.edu build-server[3140307]: debug: Command grunt finished. Output is below.
Dec 05 15:00:33 phet-server2.int.colorado.edu build-server[3140307]: debug: stderr is empty.
Dec 05 15:00:33 phet-server2.int.colorado.edu build-server[3140307]: debug: stdout: Running "checkout-master-all" task
tagging @mattpen.
The first thing I'm going to try is to load parseScreenNames locally. It seems to be the only puppeteer usage that isn't using puppeteerLoad, so perhaps refactoring to that could solve things. I'm just guessing though.
@zepumph - this has been occurring since well before #268, the first occurence was December 2021, so probably not related to the upgrade unfortunately. It also only happens occasionally, although it has been increasing in frequency the past few weeks so probably worth bumping the priority. I think it is probably more related to some bottleneck on phet-server2 (although it was occuring on phet-server as well).
I'll take a first pass this afternoon, the history makes me more confident that using pupeeteerLoad will help things.
Also of note: once my maintenance releases get production deployed, it should help out with the translation builds.
Using this patch I can test parsing screen names locally. After running 5 times I didn't reproduce anything:
I would use this to test against my commits below:
I was able to get to a commit point converting this to use puppeteer. Here are a couple of changes that made it not just a exclusive refactor:
So this change isn't necessarily a fix, but puppeteerLoad does come with a lot more error handling. We may be able to see how things are happening. My guess is that this happens when there is a failure to deliver the webpage from phet.colorado.edu, but I'm not sure.
@mattpen I think the next step is a review and to put it into practice.
@zepumph - As far as review I'm a bit suspicious of the way Promises are handled in puppeteerLoad.js. Why are we calling resolve( new Error ...
instead of reject( new Error ...
? I also had trouble following the logic flow with the promise closure. If we used reject instead we could use try/catch blocks instead of type checking the result to see if it is an error.
I think this is essentially equivalent to what you wrote, but IMHO it uses async and promise patterns more semantically. Let me know what you think. I don't know what other uses would be affected by the change I suggested in puppeteerLoad, some additional refactoring may be needed and if it is working as-is then we can probably defer this.
As far as putting it in practice, I think we can just coordinate a time to pull master for perennial and restart the build-server on phet-server2, then do a deploy of chains to make sure it isn't broken. Then we'll just have to wait and see if the errors disappear or reduce in frequency.
~Wait - I think I see an error I introduced. If reject is called then we won't close the page or browser. I'll update the patch momentarily.~
Patch updated above.
I hear that your patch is more idiomatic. The file was written by @jonathanolson with errors resolving, and there have been a couple other cases across the project where we have resolved errors so it didn't really both me. I think I would most prefer to keep it the way it is if that is alright.
Are you available this afternoon for an integration pair?
@mattpen convinced me that we should refactor puppeteerLoad to reject errors, see https://github.com/phetsims/perennial/issues/291.
We updated the build server with the new version which uses puppeteerLoad. We both agree that this is not a fix, but that it should show the error much more specifically if it happens again.
We either think that it is a race condition from when we copy the all HTML to the NFS and then immediately try to access it through phet.colorado.edu, or a problem with resource limitations.
We'll come back the next time the error occurs.
There was a build failure for a
gu
translation of ESP this week. I verified that the sim file, the screen titles, and the database entry have all been added, so either the translator resubmitted the build successfully or this was somehow eventually correct.This error likely occured while waiting for puppeteer to parse screen names from the translated sim. If this starts recurring we should consider (a) increasing the timeout for puppeteer or (b) adding in some error handling and retries. This likely just occurred while the server was busy.
For now, I'm just going to wait and see what happens as it seems like this isn't causing significant problems.