phetsims / perennial

Maintenance tools that won't change with different versions of chipper checked out
MIT License
2 stars 5 forks source link

During maintenance process, fatal: Unable to create 'acid-base-solutions/.git/index.lock': File exists. #284

Closed samreid closed 1 year ago

samreid commented 1 year ago

From https://github.com/phetsims/perennial/issues/283, @mattpen and I saw git lock problems around creating the README file. Note this problem will be rare after #283 is fixed, but we thought we should mention it.

Deployed: https://phet.colorado.edu/sims/html/acid-base-solutions/latest/acid-base-solutions_en.html
Please wait for the build-server to complete the deployment, and then test!
After testing, let the simulation lead know it has been deployed, so they can edit metadata on the website
Updating master README
>> Detected failure during deploy, reverting to master
Maintenance task failed:
Error: Failure with production deploy for acid-base-solutions to 1.2: Error: git checkout master in ../acid-base-solutions failed with exit code 128
stderr:
fatal: Unable to create '/Users/samreid/apache-document-root/main/acid-base-solutions/.git/index.lock': File exists.

Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.

    at Function.deployProduction (/Users/samreid/apache-document-root/main/perennial/js/common/Maintenance.js:900:17)
Full Error details:
{}
jonathanolson commented 1 year ago

Is this something where we were trying to concurrently run multiple copies of git?

samreid commented 1 year ago

I'm not sure, @mattpen is that possible?

mattpen commented 1 year ago

I think this process was running on your machine Sam. Did you have a git dialog open in webstorm while you were running the deploy perhaps? Or a separate terminal with a git operation in progress? This was not a problem with the server side process.

samreid commented 1 year ago

I'm not aware of anything that was running concurrently, but maybe a process had failed silently somewhere? Want to close this issue and come back to it if it recurs?

mattpen commented 1 year ago

It seems difficult to reproduce this and low consequence if it reoccurs. Closing sounds appropriate.

marlitas commented 1 year ago

This happened again in Mean Share and Balance. https://github.com/phetsims/mean-share-and-balance/issues/127

mattpen commented 1 year ago

This looks like there are some race conditions for git operations in the grunt production task AFTER it makes the http call to the build-server. I did a quick glance at production.js - my guess is that the first line in that snippet ( is returning a sucessful Promise before the git commands that run in generateREADME.js actually complete. Seems like it would be good to just import generateREADME() and call it directly rather than calling await execute( gruntCommand, [ 'published-README' ],..., but that would introduce a dependency on chipper in perennial which is no good. I'm not sure if generateREADME.js could be moved to perennial, or if there is a better solution. As this problem is very transient it will be difficult to even confirm that the problem I described is truly the root cause.

This is really outside of my wheelhouse, I own the build-server code in perennial - not the grunt tasks. production.js was written by @jonathanolson and generateREADME.js was written by @pixelzoom. Can either of you help?

pixelzoom commented 1 year ago

Sorry, this is also outside my wheelhouse. I wrote generateREADME.js 7+ years ago, as a standalone grunt task, and I haven't touched it since. And I have no familiarity with the current incarnation of the build-server, or how it uses generateREADME.js.

mattpen commented 1 year ago

I have no familiarity with the current incarnation of the build-server, or how it uses generateREADME.js.

Just to clarify, the build-server does not use generateREADME.js, this is used by the grunt production task on the developer's machine.

jonathanolson commented 1 year ago

Seems like it would be good to just import generateREADME() and call it directly rather than calling await execute( gruntCommand, [ 'published-README' ],..., but that would introduce a dependency on chipper in perennial which is no good.

We're potentially calling very old versions of the chipper's published-README, so we can't import it from perennial.

jonathanolson commented 1 year ago

This is causing a lot of issues trying to push out production deployments, investigating.

jonathanolson commented 1 year ago

It looks like we're not missing await, executes are running serially for the entire duration.

jonathanolson commented 1 year ago

It looks like https://github.com/phetsims/chipper/commit/6325d0dfe256cf16dec2f5a8424dce436ad5b6e0 turned generateReadme() into an async function, but did NOT add awaits in two usages in chipper's gruntfile (grunt published-README was affected). This created a race condition depending on whether the grunt command would finish execution of the git add. The actual execution of grunt would END (and so our await execute ended), but the git command would still be running (specifically the git status --porcelain).

Most of the failures we were getting were in master (for the generation on master), but this will also affect release branch deployments for branches after that date. We'll need to patch this in.

jonathanolson commented 1 year ago

NOTES for future tracking down of file accesses and processes on macOS:

strace isn't available on macOS, but a combination of the following helped:

  1. Temporarily open restore mode (M1 mac hold down power button longer on startup to see options, open terminal), csrutil disable
  2. sudo newproc.d helped track down created processes (showed the parent process and PID)
  3. sudo opensnoop -a -c -g -s -t helped track down file accesses where PID was present (confirmed a PID that we didn't launch directly)
  4. fs_usage -w -f filesys helped track down thread IDs and processes that were hitting the files (but gave only thread IDs). Gives more fine-grained view on the system calls.
  5. console.log( process.pid ) gave us process IDs to refer to from the other tools
  6. Tracking start/stop of things called by execute() to make sure there were no overlaps.
jonathanolson commented 1 year ago

Patches applied above.

jonathanolson commented 1 year ago

Deployed, closing.