get_commit endpoint doesn't report actual code running if directory has changed since server start

gwhitney commented 11 months ago

Although it passes the test we have implemented for it in development, when installed on the production machine, and then queried via an HTTP get to https://numberscope.colorado.edu/api/get_commit, a 500 Internal Server Error response is generated. The information written out to api.log is included below. One possible guess is that perhaps, unlike in the development/testing environment, maybe the production process is not running in the directory in which backscope was checked out. Or there may be some other obstruction, like maybe the user id of the production process is not the scope user which has the clone checked out... hopefully the backtrace below will help some in discerning what's going on. We may need to do some joint debugging if it is not clear.

Exception on /api/get_commit [GET]
Traceback (most recent call last):
  File "/home/scope/repos/backscope/.venv/lib/python3.9/site-packages/flask/app.py", line 1455, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/scope/repos/backscope/.venv/lib/python3.9/site-packages/flask/app.py", line 869, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/scope/repos/backscope/.venv/lib/python3.9/site-packages/flask/app.py", line 867, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/scope/repos/backscope/.venv/lib/python3.9/site-packages/flask/app.py", line 852, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/home/scope/repos/backscope/flaskr/nscope/views.py", line 324, in get_git_commit
    short_hash = subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD'], encoding='utf8')
  File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['git', 'rev-parse', '--short', 'HEAD']' returned non-zero exit status 128.

gwhitney commented 11 months ago

OK, the ProductionTesting branch including the 'pwd' endpoint runs in production and produces /home/scope/repos/backscope, so the difficulty is neither that the process is in the wrong directory nor that subprocess.check_output simply doesn't work. I will continue probing; thoughts of things to try are welcome. My next step will be to see if I can capture stderr and display that, too.

gwhitney commented 11 months ago

Aha, it is a user id difficulty. I was able to capture standard error, and now the get_commit endpoint in production is producing

CalledProcessError  true
output  "fatal: detected dubious ownership in repository at '/home/scope/repos/backscope'\nTo add an exception for this directory, call:\n\n\tgit config --global --add safe.directory /home/scope/repos/backscope\n"
returncode  128

So @Vectornaut @katestange thoughts? Should I do the git config --global thing it suggests? My impression is that will allow any user to execute any git command in that directory; could that be a security risk of some sort? Should we try to arrange that the repos directory is owned by the user id that the Green Unicorn process is running under (either by changing the ownership of that directory, or by changing what user id runs the Green Unicorn)? Should we instead try to empower the Green Unicorn user to be able to su to the scope user?

Oh. Actually, it occurs to me that the implementation of get_commit is simply wrong, don't know why I didn't see this before. If you start the backscope, then update the code in the directory it was run from without restarting the server, the get_commit endpoint will happily report the new commit hash, even though what it is actually running is the prior code. So we need to change the code to capture the commit hash at startup time and keep that in memory and report that value it has in memory whenever it's asked. That change should also solve the original problem reported here.

So, my actual question is: thoughts on how to capture that information at startup time? Run git rev-parse --short HEAD in the startup script and write the result into the .env file? Or write that result into the database? Or write that result into one of the source files for the backscope? (Weird concept but I am brainstorming...) Or just write it into some conventionally-named file in the backscope repo directory, and then (critically) read that file only at startup time and keep the read value in memory? None of these is shouting to me that it's clearly the correct way to do it, so I will wait to implement anything until I hear some thoughts/feedback. Hopefully one of you will have some better idea...

I guess of the above four top-of-my head ideas, the one that feels the most comfortable to me is the final one, of grabbing the revision in the startup script and writing it into some file like '.commit' in the repo directory, and then reading that file during initialization and remembering its contents. Even that sure ain't elegant, though...

gwhitney commented 11 months ago

We discussed this at the recent meeting and proposed some possible methods for proceeding:

1) Switch to explicit versioning, with the version number in a version-controlled file, as per #99. While this makes reporting the version number easy, it makes the primary desired operation of checking out the exact code running on a server much less convenient. The consensus seemed to be not to rely on this method.

2) Record the hash to a (non-versioned) file every time HEAD of the clone changes. This would be ideal and convenient, but it appears per https://stackoverflow.com/questions/27073690/git-hook-for-whenever-head-changes that there is no non-hackish way to do this.

3) Grab the hash every time the software is invoked. This provides the desired behavior; its main drawback is that we will likely have to use different code to do this in development (a custom alternative to flask run, presumably) and in production (a modification to production.sh)

So it looks like we will have to settle on option (3) unless someone has a brainstorm. One mechanism discussed for accomplishing this option was to leverage python-dotenv to incorporate the hash into the values being grabbed from the environment. We could have a custom subcommand in wsgi.py to inject the hash into the environment (we would just have to make sure that runs before the dotenv module gets the values out of the environment) and production.sh could just be modified to put the hash into the environment before green unicorn is even invoked.

So I think that should be the plan. @katestange @Vectornaut any further ideas/amendments? Anyone able to take this up, please self-assign the issue (if I free up enough to do it, I'll be sure to assign myself). Thanks!

gwhitney commented 7 months ago

OK, attempting to implement option 3 via python-dotenv per the immediately previous post here.

gwhitney commented 7 months ago

The fix as merged actually caused production to crash, see #114. Reopening until hot fix is merged; however, this bug does appear to be fixed in what's already running in production.

gwhitney commented 7 months ago

OK, hot fix is in main (i.e. #114 is merged) and running in production and appears to work. The next person to install something in production should make sure to hit the get_commit endpoint after the new code has been pulled but before numberscope has been restarted, to verify that the old commit is still reported, and then reload it after the restart, to verify it advances. However, closing this issue for now, it (or a new issue) can be (re-)opened if the recommended procedure at the next install.shows any difficulties.

numberscope / backscope

get_commit endpoint doesn't report actual code running if directory has changed since server start #108