Closed samreid closed 1 year ago
Is this something where we were trying to concurrently run multiple copies of git?
I'm not sure, @mattpen is that possible?
I think this process was running on your machine Sam. Did you have a git dialog open in webstorm while you were running the deploy perhaps? Or a separate terminal with a git operation in progress? This was not a problem with the server side process.
I'm not aware of anything that was running concurrently, but maybe a process had failed silently somewhere? Want to close this issue and come back to it if it recurs?
It seems difficult to reproduce this and low consequence if it reoccurs. Closing sounds appropriate.
This happened again in Mean Share and Balance. https://github.com/phetsims/mean-share-and-balance/issues/127
This looks like there are some race conditions for git operations in the grunt production
task AFTER it makes the http call to the build-server. I did a quick glance at production.js
- my guess is that the first line in that snippet ( is returning a sucessful Promise before the git commands that run in generateREADME.js
actually complete. Seems like it would be good to just import generateREADME()
and call it directly rather than calling await execute( gruntCommand, [ 'published-README' ],...
, but that would introduce a dependency on chipper in perennial which is no good. I'm not sure if generateREADME.js could be moved to perennial, or if there is a better solution. As this problem is very transient it will be difficult to even confirm that the problem I described is truly the root cause.
This is really outside of my wheelhouse, I own the build-server code in perennial - not the grunt tasks. production.js
was written by @jonathanolson and generateREADME.js
was written by @pixelzoom. Can either of you help?
Sorry, this is also outside my wheelhouse. I wrote generateREADME.js 7+ years ago, as a standalone grunt task, and I haven't touched it since. And I have no familiarity with the current incarnation of the build-server, or how it uses generateREADME.js.
I have no familiarity with the current incarnation of the build-server, or how it uses generateREADME.js.
Just to clarify, the build-server does not use generateREADME.js, this is used by the grunt production
task on the developer's machine.
Seems like it would be good to just import generateREADME() and call it directly rather than calling await execute( gruntCommand, [ 'published-README' ],..., but that would introduce a dependency on chipper in perennial which is no good.
We're potentially calling very old versions of the chipper's published-README, so we can't import it from perennial.
This is causing a lot of issues trying to push out production deployments, investigating.
It looks like we're not missing await
, executes are running serially for the entire duration.
It looks like https://github.com/phetsims/chipper/commit/6325d0dfe256cf16dec2f5a8424dce436ad5b6e0 turned generateReadme()
into an async function, but did NOT add awaits in two usages in chipper's gruntfile (grunt published-README
was affected). This created a race condition depending on whether the grunt command would finish execution of the git add
. The actual execution of grunt would END (and so our await execute
ended), but the git command would still be running (specifically the git status --porcelain
).
Most of the failures we were getting were in master (for the generation on master), but this will also affect release branch deployments for branches after that date. We'll need to patch this in.
NOTES for future tracking down of file accesses and processes on macOS:
strace isn't available on macOS, but a combination of the following helped:
csrutil disable
sudo newproc.d
helped track down created processes (showed the parent process and PID)sudo opensnoop -a -c -g -s -t
helped track down file accesses where PID was present (confirmed a PID that we didn't launch directly)fs_usage -w -f filesys
helped track down thread IDs and processes that were hitting the files (but gave only thread IDs). Gives more fine-grained view on the system calls.console.log( process.pid )
gave us process IDs to refer to from the other toolsexecute()
to make sure there were no overlaps.Patches applied above.
Deployed, closing.
From https://github.com/phetsims/perennial/issues/283, @mattpen and I saw git lock problems around creating the README file. Note this problem will be rare after #283 is fixed, but we thought we should mention it.