Closed cisaacstern closed 1 year ago
🧠 ⛈️ Brainstorm: Heroku Review Apps + Heroku container stack requires building Dockerfile on Heroku at each push. IIUC, there is no way to cache Docker layers for Heroku Review Apps (open to being shown some feature I'm missing, of course).
Building our current Dockerfile on Heroku seems to take about 10-13 minutes. When you add about 7 minutes for the subsequent Dataflow integration test (based on timing observed from working on https://github.com/pangeo-forge/pangeo-forge-runner/pull/52), that makes the full integration testing cycle 20 minutes. That's a long time for iterating a solution to something like #220! 🙀
What's more, Heroku Container Registry doesn't work with Review Apps:
As a workaround, perhaps we can:
Dockerfile.heroku
which is just:
FROM pangeo/forge-orchestrator:${SOME_TAG} # pushed from `Dockerfile.prebuild`
heroku.yml
say:
build:
docker:
web: Dockerfile.heroku
This would make our Review App "build" time just a matter of however long it takes to pull from Docker Hub (presumably shorter than 10-13 minutes).
I wouldn't want this to introduce another layer of brittleness/toil, requiring manual builds which could fall out of sync with the PR. So this model could follow the example of https://github.com/pangeo-forge/pangeo-forge-orchestrator/blob/main/.github/workflows/push-dataflow-image.yaml and automatically build-and-push images when PRs change the Dockerfile.
Good progress here! What works:
build-review-app
label to this PR initiates build of Heroku Review Appbuild-review-app
label deletes the Heroku Review AppWhat needs to be fixed:
head.sha
.Dockerfile.heroku
definitely seems to speed up Review App builds, but the image it is pulling from Docker Hub was pushed there manually. Add a workflow that automates this to reduce toil/brittleness.Terraform issue is fixed (big picture, we probably want to stop running terraform on release, but that's out of scope for this PR).
Now app is not releasing because there's no secrets config for it:
2023-01-25T18:26:03.272684+00:00 app[web.1]: PANGEO_FORGE_DEPLOYMENT set to pforge-pr-226
2023-01-25T18:26:03.542892+00:00 app[web.1]: Error: cannot operate on non-existent file
2023-01-25T18:26:03.663857+00:00 heroku[web.1]: Process exited with status 100
2023-01-25T18:26:03.712066+00:00 heroku[web.1]: State changed from starting to crashed
2023-01-25T18:26:29.677578+00:00 heroku[web.1]: State changed from crashed to down
In the current developer docs, there is a toilsome process outlined for adding this config manually oneself. I need to figure out a way now to inject that config, removing the burden from the developer (it's just too much to mess with for every PR).
Currently the assumption is that the config will always exist as an encrypted file (or set of files) in the repo from which the application is deployed. In #204 the organization of those files changes but the basic assumption holds that some files will exist in the repo which define configuration and secrets for the deployment. Some thoughts:
This option of cycling a defined set of credentials might have more merit if it did not also require re-configuring the corresponding GitHub App installation. Here is the set of apps currently installed in the pforgetest
organization:
For this 'cycling credentials' approach to work, after a set of creds is assigned to a newly opened PR, the corresponding GitHub App on pforgetest
would also need to have its webhook URL programmatically updated to reflect the URL with the new PR's Review App will be deployed to.
When I first started writing this comment, I expected that I would conclude that cycling credentials would be the easiest thing to do. Now that I've written this out, I think I've convinced myself that it's actually easier to forward-proxy all webhooks destined for Review Apps on pforgetest
through a single GitHub App. This would mean:
pforgetest
(or maybe two, if we separate out the staging API from Review Apps).pangeo-forge-orchestrator
.Ok I'm going to prototype this repository dispatch workflow and see if I've overlooked anything, or if this will just work.
Ok I'm going to prototype this repository dispatch workflow and see if I've overlooked anything, or if this will just work.
Already hit a road block with repository dispatch. I'd forgotten that these payloads require special formatting. That won't work because we just want to forward anything GitHub gives us, as-is. And it won't come wrapped in a client_payload
object.
On to another plan: deploy the webhook forwarding service as a GCP Cloud Function. This does feel like a bit of a heavier lift, but ultimately I do think it's the best path forward to reduce developer toil. (And ensure Review Apps + associated integration tests actually get used during the development cycle, instead of skipped over for being too tedious to set up.)
Edit: Looks like handling GitHub Webhooks is actually the first promoted use case in the GCP Cloud Functions docs. 😆
On to another plan: deploy the webhook forwarding service as a GCP Cloud Function.
This work is now underway in https://github.com/pangeo-forge/github-webhook-forwarder/pull/1
The new Dockerfile.heroku definitely seems to speed up Review App builds, but the image it is pulling from Docker Hub was pushed there manually. Add a workflow that automates this to reduce toil/brittleness.
I'm now realizing that the current Dockerfile fix is not viable as-is, because any changes to app code are not available without a rebuilt container. I think we can save most of the build speedup benefit by pulling an image that includes all of the dependencies pre-installed, and then at Heroku build time, just copying the project directory contents into that prebuilt image. This way, full rebuilds will be required for changes to dependencies (including requirements.txt
pip deps), but changes to only code or config that don't change the environment, can be fast-tracked by pulling a pre-built image.
🤔 note to self: postdeploy
script didn't seem to be running with changes in https://github.com/pangeo-forge/pangeo-forge-orchestrator/pull/226/commits/18bca7347dfba3af474ce2eb8649a1dbe9dcc1d6? Or maybe I didn't wait long enough? Just called it manually from heroku run /bin/bash -a pforge-pr-226
.
postdeploy script didn't seem to be running
Oh, this might be because its only run on the first release following review app creation, and not on rebuilds.
🎉 It works!
dev-app-proxy
GitHub App is installed in the pforgetest
test orgdev-app-proxy
GitHub Appdev-app-proxy
is set to https://github-webhook-forwarder.herokuapp.comfwd:pforge-pr-226.herokuapp.com/github/hooks/
label on https://github.com/pforgetest/gpcp-from-gcs-feedstock/pull/3 tells dev-app-proxy
where to forward webhooks originating from that PR.dev-app-proxy
successfully forwarded to the Heroku Review App for this PR:
The one thing I'm seeing that needs fixing so far is that the logic for customizing the orchestratorEndpoint
query param needs to be refined to understand the difference between the dev-app-proxy
webhook URL and the Heroku Review App url:
The next step is to try to replicate the problem in #220 by deploying a test run from the pforgetest
PR. To do that, I'll first need to make https://github.com/pangeo-forge/github-webhook-forwarder/pull/1 aware of issue_comment
events...
need to make https://github.com/pangeo-forge/github-webhook-forwarder/pull/1 aware of issue_comment events...
Done!
Now hitting
2023-01-28T01:32:13.917169+00:00 app[web.1]: ValueError: len(job_name) = 83 exceeds max dataflow job name len of 64 chars.
from
Turns out that was a helpful error to put there! 😆
So this is actually a version of the same problem as mentioned above:
the logic for customizing the orchestratorEndpoint query param needs to be refined to understand the difference between the dev-app-proxy webhook URL and the Heroku Review App url
🤔 So we need some other way of knowing where we are currently deployed, that is not based on the webhook url configured on the GitHub App (which for Heroku Review Apps, is going to be the proxy URL, not the actual deployment). Heroku injects a HEROKU_APP_NAME
variable, which we can use for this. I'll fix that here next.
🎊 Successful deployment of a job to Dataflow from the Heroku Review App for this PR!
Originating from https://github.com/pforgetest/gpcp-from-gcs-feedstock/pull/3#issuecomment-1407245080 and routed through the proxy service defined in https://github.com/pangeo-forge/github-webhook-forwarder/pull/1 and deployed to https://github-webhook-forwarder.herokuapp.com.
And successful completion of the job!
Including a comment reporting the success https://github.com/pforgetest/gpcp-from-gcs-feedstock/pull/3#issuecomment-1407259639.
Next steps:
[ci][don't-merge]
PR on, for starters, https://github.com/pforgetest/gpcp-from-gcs-feedstock. This can be the first stage of the integration test for which this PR is named. The test can be triggered when the deployment_status
event comes back with state='success'
. If for some reason we want to limit the number of Dataflow jobs submitted, we can add an additional requirement that a test-dataflow
and/or no-test-dataflow
label exist on the PR.gpcp-from-gcs
run above thickens the plot here! Apparently there may be something specific to those recipes which is making them fail to deploy. We can copy-and-paste them into a feedstock on pforgetest
, and then test them as we've done with gpcp-from-gcs
.It's all coming together! Really happy with how this test infrastructure is coming together, and very curious to leverage it to finally figure out why the recipes discussed in #220 failed to deploy.
The workflow test-dataflow-integration.yaml
now successfully makes it to the Run test step (while also providing that step with the appropriate env):
in response to both of the event paths that should trigger the workflow:
deployment_status
event occurs with state == 'success'
(and the test-dataflow
label already exists on the PR)test-dataflow
label is added to a PR which already has a successful deployment associated with the head sha.1This took a little while to get right because in each case, we need to traverse back into the GitHub API to get the full picture (for deployment_status
, we need to query PR labels; and for pull_request
labeled event, we need to query deployment status).
Now that the trigger is in place, the next step is to actually dispatch the test (instead of the placeholder echo foo
). This is the right place to dispatch to pytest, because we now know that the review app is ready to receive requests.
Thinking about what the input for the integration test should be. I believe what makes the most sense is:
REVIEW_APP_URL
: the deployment to test against (this is already part of the latest commits)SOURCE_BRANCH
: a branch to duplicate as a PR to pforgetest/test-staged-recipes
By making these env vars (rather than hardcoding them into the test fixtures), we have the option of manually running the test on an ad hoc basis against some troublesome PR or another. This would greatly help speed up the debugging cycle when some specific PR(s) start having issues (a very common type of debugging required on Pangeo Forge Cloud). Hardcoding these specific PRs into the test fixtures is less desirable both because it requires committing new code to the repo to run the test, and equally so because typically these problem PRs will be pointing at "real" data sources, which in general we do not want to be requesting data from as a regular practice in CI.
Edit, notes on above:
SOURCE_BRANCH
: a branch to duplicate as a PR topforgetest/test-staged-recipes
Presumably we'll need conditional logic to handle the case of PRs to pforgetest/*-feedstock
repos. But the test-staged-recipes
is a good place to start.
manually running the test on an ad hoc basis
This can be via workflow dispatch.
Next steps:
tests.integration/test_dataflow.py
is nearly complete. As a gut check, it's probably work running this test manually from my local machine next, to see if the correct state is created on GitHub. There could easily be some small typos in this code, and it will be easier to debug manually than here in CI.The last commit hides our token in the logs in the case of test failures, e.g.:
―――――――――――――――――――――――――――――――――――――――――――――― ERROR at teardown of test_dataflow ――――――――――――――――――――――――――――――――――――――――――――――
gh = <gidgethub.aiohttp.GitHubAPI object at 0x113279940>, gh_token = SecretStr('**********')
gh_kws = {'accept': 'application/vnd.github+json'}, gh_workflow_run_id = '12345'
source_pr = {'pr_number': '22', 'repo_full_name': 'pforgetest/test-staged-recipes'}, base = 'pforgetest/test-staged-recipes'
pr_label = 'fwd:pforge-pr-226.herokuapp.com/github/hooks/'
I've now called tests.integration/test_dataflow.py
manually with:
$ GH_WORKFLOW_RUN_ID=12345 \
SOURCE_REPO_FULL_NAME=pforgetest/test-staged-recipes \
SOURCE_REPO_PR_NUMBER=22 \
REVIEW_APP_URL=pforge-pr-226.herokuapp.com \
pytest -vx tests.integration/test_dataflow.py
which has successfully created (and subsequently deleted) a test PR:
https://github.com/pforgetest/test-staged-recipes/pull/28
An interesting finding from this experiment:
pull_request.opened
webhook is not forwarded to the specified review app by the dev-app-proxy
GitHub App because it does not have any labels at open-time.synchronize
task to run, then, we'll need to make another commit to the PR after it is opened and labeled.As shown in the conversation thread on https://github.com/pforgetest/test-staged-recipes/pull/33:
/run
comment on the test PR2023-02-08T23:56:27.881666+00:00 app[web.1]: File "/opt/app/pangeo_forge_orchestrator/routers/github_app.py", line 743, in run
2023-02-08T23:56:27.881666+00:00 app[web.1]: message = json.loads(recipe_run.message)
2023-02-08T23:56:27.881666+00:00 app[web.1]: File "/usr/lib/python3.9/json/__init__.py", line 339, in loads
2023-02-08T23:56:27.881666+00:00 app[web.1]: raise TypeError(f'the JSON object must be str, bytes or bytearray, '
2023-02-08T23:56:27.881667+00:00 app[web.1]: TypeError: the JSON object must be str, bytes or bytearray, not NoneType
I've corrected what appears to be the problem reflected in that trace in the last commit. Not sure if this is also the problem effecting the blocked PRs in #220, but its not impossible. (When I get to the point of testing against those PRs, I can temporarily revert this change, to see if it is indeed what makes the difference.)
I'm also quite confused as to why, higher up in this trace we're seeing this repo2docker/utils.py
error related to one of the lines in the Dockerfile in this repo. I guess when pangeo-forge-runner is initialized from the root of this repo, it is seeing the Dockerfile and trying to ... build something from it? 🤷
I'll remove this recipe_run.message issue from consideration first, and then see if this repo2docker thing persists. Actually, though it does seem like this may be significant, because I recall being quite confused when looking at other traces from #220 by Docker-related errors referencing lines from out Dockerfile.
I'll remove this recipe_run.message issue from consideration first, and then see if this repo2docker thing persists.
I've just manually run this test against the review app built from 75a6d02, which resolved the recipe_run.message issue. The traceback from the /run
command issued on https://github.com/pforgetest/test-staged-recipes/pull/34 (the automated PR made by my latest manual test run) confirm that the remaining issue is repo2docker-related:
I think what's happening here is that when the subprocess call is made to pangeo-forge-runner
, the presences of the Dockerfile for our web app in the working directory is throwing off some inner layer of initialization of pangeo-forge-runner
... this is definitely unexpected and (to me) a highly non-obvious failure mode! I will try removing the Dockerfile from the production container (it's not needed there) and see of that resolves the problem.
Edit: Above theory is totally wrong 😆 . I now see the git reset
command which failed is the one we expect this call to make, which is for reseting the clone of the feedstock repo. So this is not, in fact, some totally non-obvious code path, but actually just a failure of the obvious code path, which is attempting to clone the repo we've asked for. Will try to debug that now...
Will try to debug that now...
Initial notes on this. Inside a locally running instance of the app container, git reset
can be called without error both from the bash console, and from a python subprocess
root@1f008a222d9f:/opt/app/test-staged-recipes# git reset --hard 246a00f8f0939883c7919a9a9e17880888bf6087
HEAD is now at 246a00f Create meta.yaml
root@1f008a222d9f:/opt/app/test-staged-recipes# python3
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> subprocess.check_output("git reset --hard 246a00f8f0939883c7919a9a9e17880888bf6087".split())
b'HEAD is now at 246a00f Create meta.yaml\n'
>>>
so this would not appear to be an issue with a broken git
installation
Edit: Confirmed with heroku run /bin/bash -a pforge-pr-226
that these same commands can run inside the deployed review app container (as opposed to just a local build).
Update: disregard git confusion above! With only a minor modification, the test as written can actually submit a job to dataflow, as observed in my latest manual run, which deployed a job to dataflow from https://github.com/pforgetest/test-staged-recipes/pull/35.
The only change needed (which I have not yet pushed here) is to time.sleep(60)
after making the /run
comment. Previously, the fixture cleanup was deleting the test branch before the job could be deployed.
Next steps:
Right now the failure point on the test is related to getting the GitHub credentials decrypted with SOPS. So far I've been doing it this way because it allows us to avoid duplicating (semi-verbose) code used in the FastAPI app for authenticating as a GitHub App.
The more I think about it though, relying on that app itself in its own test, even if the part of the app I'm relying on is not specifically what's being tested, smells like a circularity. So rather than trying to solve this SOPS thing, I'm just going to bite the bullet and duplicate some code into the test, thereby dropping this circularity concern.
The last commit was the first to successfully create a PR, here: https://github.com/pforgetest/test-staged-recipes/pull/39
But for some reason the app container restarted after the synchronization task began, leaving it unfinished
2023-02-10T18:41:02.758459+00:00 app[web.1]: 2023-02-10 18:41:02,758 INFO - orchestrator - Synchronizing https://github.com/pforgetest/test-staged-recipes at 0299239b5b30e03fbc11f7572c31ce0d3963242a.
2023-02-10T18:41:02.760749+00:00 heroku[router]: at=info method=POST path="/github/hooks/" host=pforge-pr-226.herokuapp.com request_id=f136954d-0bf9-4921-a2d0-8b9fc78ca1db fwd="52.90.187.124" dyno=web.1 connect=0ms service=204ms status=202 bytes=412 protocol=https
2023-02-10T18:41:03.330084+00:00 app[api]: Release v22 created by user charlesisaacstern@gmail.com
2023-02-10T18:41:03.567705+00:00 heroku[web.1]: Restarting
2023-02-10T18:41:03.571439+00:00 heroku[web.1]: State changed from up to starting
2023-02-10T18:41:04.315253+00:00 heroku[web.1]: Stopping all processes with SIGTERM
This looks like we're not waiting quite long enough in the Is app up?
step of the workflow.
Seems like the failure on the last test run may have just been a fluke, as the latest push resulted in a successful job deployment, from the automated PR https://github.com/pforgetest/test-staged-recipes/pull/40
So the next thing is to build some awareness of the job completion state into the test.
🎉 Calling this test locally now succeeds, with the following command
$ DEV_APP_PROXY_GITHUB_APP_PRIVATE_KEY='****' \
GH_WORKFLOW_RUN_ID=12345 \
SOURCE_REPO_FULL_NAME=pforgetest/test-staged-recipes \
SOURCE_REPO_PR_NUMBER=22 \
REVIEW_APP_URL=https://pforge-pr-226.herokuapp.com/ \
pytest -vxs tests.integration/test_dataflow.py
outputting
[...pytest boilerplate clipped...]
collecting ...
Yielding completed_pr['head']['sha'] = '6589c0c167d808bc0eba3ad1242d919a800cb6ab' from recipes_pr fixture...
Querying review app database for recipe run id...
Found matching recipe run in review app database with recipe_run_id = 23...
Making comment on test PR with comment_body = '/run gpcp-from-gcs'
Polling review app for dataflow job submission status...
Confirmed dataflow job submitted with job_id = '2023-02-10_16_45_38-13412192724389530754'
Waiting for 5 mins, starting at utc_time = '2023-02-11 00:45:55'
Time elapsed = 300.0161998271942
Current state = 'Running'
Time elapsed = 332.52712869644165
Current state = 'Done'
tests.integration/test_dataflow.py::test_dataflow ✓ 100% ██████████
And here is the pr that was created: https://github.com/pforgetest/test-staged-recipes/pull/58
Next steps:
Parametrize source files for test PR and see if we can reproduce errors discussed in https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/220
This is almost complete. Just making a few notes on it here so I remember where to pick up with it tomorrow (which can also be useful for forthcoming documentation).
Dataflow integration test now assumes all source PRs are active against pforgetest/test-staged-recipes
The test uses a syntax {pr_number}::{recipe_id}
to figure out what pr numbers on pforgetest/test-staged-recipes
to replicate for the test, and within that replicated pr, what specific recipe to run.
By default, the test runs against 22::gpcp-from-gcs
Additional test targets can be specified by adding a label to the orchestrator PR (e.g. this current PR) in the format {also-test:{pr_number}::{recipe_id}
.
Currently the jobs are being spawned correctly, but there is a crash occurring where they are trying to use the same ref name to make their duplicate PRs, causing the test session which gets to the PR creation step second to error out with Reference already exists
This is because the duplicated PRs attempt to use the workflow run number for the ref name, so I just need to add one more layer of specificity here, tied to the job name somehow.
One last thought: it's possible certain PRs will require a longer Dataflow timeout, which if so, I can add as part of the label syntax, making it, e.g. {pr_number}::{recipe_id}::{timeout}s
.
While not every single task listed above is complete here, I'm going to merge this now, because the core tests all pass, and also because some enhancements made here to review app builds will be very useful for starting to work on #233.
Towards debugging #220 Closes #225 when complete
As outlined in https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/225#issuecomment-1401212954, the first step towards this testing is gaining manual (PR-label-based) control over when Heroku Review Apps are built. To recap, without this we would either: