Add dataflow integration test

pangeo-forge / pangeo-forge-orchestrator

Database API and GitHub App backend for Pangeo Forge Cloud.

https://api.pangeo-forge.org/docs

Apache License 2.0

4 stars 1 forks source link

Add dataflow integration test #226

Closed cisaacstern closed 1 year ago

cisaacstern commented 1 year ago

Towards debugging #220 Closes #225 when complete

As outlined in https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/225#issuecomment-1401212954, the first step towards this testing is gaining manual (PR-label-based) control over when Heroku Review Apps are built. To recap, without this we would either:

Automatically build a review app for every commit of every PR (therefore triggering a dataflow integration test at every commit of every PR) - too costly / messy
Require a Heroku team member to manually click Create on the Heroku Dashboard when a review app is desired - too toilsome

cisaacstern commented 1 year ago

🧠 ⛈️ Brainstorm: Heroku Review Apps + Heroku container stack requires building Dockerfile on Heroku at each push. IIUC, there is no way to cache Docker layers for Heroku Review Apps (open to being shown some feature I'm missing, of course).

Building our current Dockerfile on Heroku seems to take about 10-13 minutes. When you add about 7 minutes for the subsequent Dataflow integration test (based on timing observed from working on https://github.com/pangeo-forge/pangeo-forge-runner/pull/52), that makes the full integration testing cycle 20 minutes. That's a long time for iterating a solution to something like #220! 🙀

What's more, Heroku Container Registry doesn't work with Review Apps:

As a workaround, perhaps we can:

Pre-build the Dockerfile and push it to a public container registry (e.g. Docker Hub)

Add a new Dockerfile.heroku which is just:

FROM pangeo/forge-orchestrator:${SOME_TAG}  # pushed from `Dockerfile.prebuild`

And in heroku.yml say:

build:
  docker:
    web: Dockerfile.heroku

This would make our Review App "build" time just a matter of however long it takes to pull from Docker Hub (presumably shorter than 10-13 minutes).

I wouldn't want this to introduce another layer of brittleness/toil, requiring manual builds which could fall out of sync with the PR. So this model could follow the example of https://github.com/pangeo-forge/pangeo-forge-orchestrator/blob/main/.github/workflows/push-dataflow-image.yaml and automatically build-and-push images when PRs change the Dockerfile.

cisaacstern commented 1 year ago

Good progress here! What works:

Adding build-review-app label to this PR initiates build of Heroku Review App
Removing build-review-app label deletes the Heroku Review App

What needs to be fixed:

[x] The review app build is succeeding but release is failing due to a terraform state issue
[x] The SHA being passed to the Heroku API call looks wrong. I think I'm probably not using head.sha.
[x] This sha vs head.sha issue might be why we're not seeing the deployment registered on the PR discussion thread here. Generally, we want the Heroku API call made by the build workflow to successfully register a deployment environment.
[ ] The new Dockerfile.heroku definitely seems to speed up Review App builds, but the image it is pulling from Docker Hub was pushed there manually. Add a workflow that automates this to reduce toil/brittleness.

cisaacstern commented 1 year ago

Terraform issue is fixed (big picture, we probably want to stop running terraform on release, but that's out of scope for this PR).

Now app is not releasing because there's no secrets config for it:

2023-01-25T18:26:03.272684+00:00 app[web.1]: PANGEO_FORGE_DEPLOYMENT set to pforge-pr-226
2023-01-25T18:26:03.542892+00:00 app[web.1]: Error: cannot operate on non-existent file
2023-01-25T18:26:03.663857+00:00 heroku[web.1]: Process exited with status 100
2023-01-25T18:26:03.712066+00:00 heroku[web.1]: State changed from starting to crashed
2023-01-25T18:26:29.677578+00:00 heroku[web.1]: State changed from crashed to down

In the current developer docs, there is a toilsome process outlined for adding this config manually oneself. I need to figure out a way now to inject that config, removing the burden from the developer (it's just too much to mess with for every PR).

Currently the assumption is that the config will always exist as an encrypted file (or set of files) in the repo from which the application is deployed. In #204 the organization of those files changes but the basic assumption holds that some files will exist in the repo which define configuration and secrets for the deployment. Some thoughts:

If we want Review Apps to be able to interact with GitHub (which IMO is an important part of any true integration test), then generating this config on-the-fly for a PR is difficult, because creating new GitHub App credentials requires an in-browser auth flow.
It would be great (and much simpler) if multiple PR's could share the same credentials, but I'm pretty sure they can't (or at least it's not simple) ... because GitHub Apps can only have one webhook URL configured at any given time, which (without adding an additional proxy layer) would have to be the URL for the Heroku Review App under test.
Assuming for the moment we don't add a webhook forwarding proxy in between GitHub and the Review App, this means that every single Review App needs its own credentials. The only way I can think to implement this which also fulfills the constraint of no on-the-fly credential generation, is to maintain a set of X credential/config files in this repo at all times (where X is the max number of Review Apps we expect to be operating at once), and then define a workflow wherein an unused set of creds is re-assigned to each PR on creation.
This option of cycling a defined set of credentials might have more merit if it did not also require re-configuring the corresponding GitHub App installation. Here is the set of apps currently installed in the pforgetest organization:

For this 'cycling credentials' approach to work, after a set of creds is assigned to a newly opened PR, the corresponding GitHub App on pforgetest would also need to have its webhook URL programmatically updated to reflect the URL with the new PR's Review App will be deployed to.

When I first started writing this comment, I expected that I would conclude that cycling credentials would be the easiest thing to do. Now that I've written this out, I think I've convinced myself that it's actually easier to forward-proxy all webhooks destined for Review Apps on pforgetest through a single GitHub App. This would mean:

One GitHub App installed in pforgetest (or maybe two, if we separate out the staging API from Review Apps).
All Review Apps use the credentials for this same GitHub App.
The webhook URL for this GitHub App points not to any single Review App, but rather to a forwarding service, which knows how to inspect the GitHub payload to determine where to forward the payload. PR labels are probably the best place to embed this information.
The forwarding service can be, at least to start, just a repository dispatch call to a workflow here in pangeo-forge-orchestrator.
This proxying is only for GitHub App webhooks being POSTed to Review Apps. When Review Apps POST to GitHub, they interact directly with their (shared) set of creds. This means that they should probably identify themselves somehow. For example, in the case of issue comments, they could add a note "This comment posted from the Review App for PR 204" or similar.

Ok I'm going to prototype this repository dispatch workflow and see if I've overlooked anything, or if this will just work.

cisaacstern commented 1 year ago

Ok I'm going to prototype this repository dispatch workflow and see if I've overlooked anything, or if this will just work.

Already hit a road block with repository dispatch. I'd forgotten that these payloads require special formatting. That won't work because we just want to forward anything GitHub gives us, as-is. And it won't come wrapped in a client_payload object.

On to another plan: deploy the webhook forwarding service as a GCP Cloud Function. This does feel like a bit of a heavier lift, but ultimately I do think it's the best path forward to reduce developer toil. (And ensure Review Apps + associated integration tests actually get used during the development cycle, instead of skipped over for being too tedious to set up.)

Edit: Looks like handling GitHub Webhooks is actually the first promoted use case in the GCP Cloud Functions docs. 😆

cisaacstern commented 1 year ago

On to another plan: deploy the webhook forwarding service as a GCP Cloud Function.

This work is now underway in https://github.com/pangeo-forge/github-webhook-forwarder/pull/1

cisaacstern commented 1 year ago

The new Dockerfile.heroku definitely seems to speed up Review App builds, but the image it is pulling from Docker Hub was pushed there manually. Add a workflow that automates this to reduce toil/brittleness.

I'm now realizing that the current Dockerfile fix is not viable as-is, because any changes to app code are not available without a rebuilt container. I think we can save most of the build speedup benefit by pulling an image that includes all of the dependencies pre-installed, and then at Heroku build time, just copying the project directory contents into that prebuilt image. This way, full rebuilds will be required for changes to dependencies (including requirements.txt pip deps), but changes to only code or config that don't change the environment, can be fast-tracked by pulling a pre-built image.

cisaacstern commented 1 year ago

🤔 note to self: postdeploy script didn't seem to be running with changes in https://github.com/pangeo-forge/pangeo-forge-orchestrator/pull/226/commits/18bca7347dfba3af474ce2eb8649a1dbe9dcc1d6? Or maybe I didn't wait long enough? Just called it manually from heroku run /bin/bash -a pforge-pr-226.

cisaacstern commented 1 year ago

postdeploy script didn't seem to be running

Oh, this might be because its only run on the first release following review app creation, and not on rebuilds.

cisaacstern commented 1 year ago

🎉 It works!

https://github.com/pangeo-forge/github-webhook-forwarder/pull/1 is running on https://github-webhook-forwarder.herokuapp.com
The dev-app-proxy GitHub App is installed in the pforgetest test org
This PR has creds for the dev-app-proxy GitHub App
The webhook URL for the dev-app-proxy is set to https://github-webhook-forwarder.herokuapp.com
The fwd:pforge-pr-226.herokuapp.com/github/hooks/ label on https://github.com/pforgetest/gpcp-from-gcs-feedstock/pull/3 tells dev-app-proxy where to forward webhooks originating from that PR.
I made a commit on that PR, which dev-app-proxy successfully forwarded to the Heroku Review App for this PR:
The Heroku Review app for this PR received the forwarded webhook, synchronized the commit to its own database, and updated the check run on the PR accordingly!

The one thing I'm seeing that needs fixing so far is that the logic for customizing the orchestratorEndpoint query param needs to be refined to understand the difference between the dev-app-proxy webhook URL and the Heroku Review App url:

cisaacstern commented 1 year ago

The next step is to try to replicate the problem in #220 by deploying a test run from the pforgetest PR. To do that, I'll first need to make https://github.com/pangeo-forge/github-webhook-forwarder/pull/1 aware of issue_comment events...

cisaacstern commented 1 year ago

need to make https://github.com/pangeo-forge/github-webhook-forwarder/pull/1 aware of issue_comment events...

Done!

Now hitting

2023-01-28T01:32:13.917169+00:00 app[web.1]: ValueError: len(job_name) = 83 exceeds max dataflow job name len of 64 chars.

from

https://github.com/pangeo-forge/pangeo-forge-orchestrator/blob/d45a669723c57df3a2f44eeec5478b757600eb8d/pangeo_forge_orchestrator/routers/github_app.py#L644-L646

Turns out that was a helpful error to put there! 😆

So this is actually a version of the same problem as mentioned above:

the logic for customizing the orchestratorEndpoint query param needs to be refined to understand the difference between the dev-app-proxy webhook URL and the Heroku Review App url

🤔 So we need some other way of knowing where we are currently deployed, that is not based on the webhook url configured on the GitHub App (which for Heroku Review Apps, is going to be the proxy URL, not the actual deployment). Heroku injects a HEROKU_APP_NAME variable, which we can use for this. I'll fix that here next.

cisaacstern commented 1 year ago

🎊 Successful deployment of a job to Dataflow from the Heroku Review App for this PR!

Originating from https://github.com/pforgetest/gpcp-from-gcs-feedstock/pull/3#issuecomment-1407245080 and routed through the proxy service defined in https://github.com/pangeo-forge/github-webhook-forwarder/pull/1 and deployed to https://github-webhook-forwarder.herokuapp.com.

And successful completion of the job!

Including a comment reporting the success https://github.com/pforgetest/gpcp-from-gcs-feedstock/pull/3#issuecomment-1407259639.

Next steps:

[ ] Automated deployment of a job like this. Probably involving using the GitHub API to make a comment on a [ci][don't-merge] PR on, for starters, https://github.com/pforgetest/gpcp-from-gcs-feedstock. This can be the first stage of the integration test for which this PR is named. The test can be triggered when the deployment_status event comes back with state='success'. If for some reason we want to limit the number of Dataflow jobs submitted, we can add an additional requirement that a test-dataflow and/or no-test-dataflow label exist on the PR.
[ ] 🕵️ Figure out why the jobs discussed in #220 failed to deploy! The success of the gpcp-from-gcs run above thickens the plot here! Apparently there may be something specific to those recipes which is making them fail to deploy. We can copy-and-paste them into a feedstock on pforgetest, and then test them as we've done with gpcp-from-gcs.

It's all coming together! Really happy with how this test infrastructure is coming together, and very curious to leverage it to finally figure out why the recipes discussed in #220 failed to deploy.

cisaacstern commented 1 year ago

The workflow test-dataflow-integration.yaml now successfully makes it to the Run test step (while also providing that step with the appropriate env):

in response to both of the event paths that should trigger the workflow:

A deployment_status event occurs with state == 'success' (and the test-dataflow label already exists on the PR)
The test-dataflow label is added to a PR which already has a successful deployment associated with the head sha.1

This took a little while to get right because in each case, we need to traverse back into the GitHub API to get the full picture (for deployment_status, we need to query PR labels; and for pull_request labeled event, we need to query deployment status).

Now that the trigger is in place, the next step is to actually dispatch the test (instead of the placeholder echo foo). This is the right place to dispatch to pytest, because we now know that the review app is ready to receive requests.

cisaacstern commented 1 year ago

Thinking about what the input for the integration test should be. I believe what makes the most sense is:

REVIEW_APP_URL: the deployment to test against (this is already part of the latest commits)
SOURCE_BRANCH: a branch to duplicate as a PR to pforgetest/test-staged-recipes

By making these env vars (rather than hardcoding them into the test fixtures), we have the option of manually running the test on an ad hoc basis against some troublesome PR or another. This would greatly help speed up the debugging cycle when some specific PR(s) start having issues (a very common type of debugging required on Pangeo Forge Cloud). Hardcoding these specific PRs into the test fixtures is less desirable both because it requires committing new code to the repo to run the test, and equally so because typically these problem PRs will be pointing at "real" data sources, which in general we do not want to be requesting data from as a regular practice in CI.

Edit, notes on above:

SOURCE_BRANCH: a branch to duplicate as a PR to pforgetest/test-staged-recipes

Presumably we'll need conditional logic to handle the case of PRs to pforgetest/*-feedstock repos. But the test-staged-recipes is a good place to start.

manually running the test on an ad hoc basis

This can be via workflow dispatch.

cisaacstern commented 1 year ago

Next steps:

[x] Fixturing for tests.integration/test_dataflow.py is nearly complete. As a gut check, it's probably work running this test manually from my local machine next, to see if the correct state is created on GitHub. There could easily be some small typos in this code, and it will be easier to debug manually than here in CI.

cisaacstern commented 1 year ago

The last commit hides our token in the logs in the case of test failures, e.g.:

―――――――――――――――――――――――――――――――――――――――――――――― ERROR at teardown of test_dataflow ――――――――――――――――――――――――――――――――――――――――――――――

gh = <gidgethub.aiohttp.GitHubAPI object at 0x113279940>, gh_token = SecretStr('**********')
gh_kws = {'accept': 'application/vnd.github+json'}, gh_workflow_run_id = '12345'
source_pr = {'pr_number': '22', 'repo_full_name': 'pforgetest/test-staged-recipes'}, base = 'pforgetest/test-staged-recipes'
pr_label = 'fwd:pforge-pr-226.herokuapp.com/github/hooks/'

xref https://github.com/pytest-dev/pytest/issues/8613

cisaacstern commented 1 year ago

I've now called tests.integration/test_dataflow.py manually with:

$ GH_WORKFLOW_RUN_ID=12345 \
SOURCE_REPO_FULL_NAME=pforgetest/test-staged-recipes \
SOURCE_REPO_PR_NUMBER=22 \
REVIEW_APP_URL=pforge-pr-226.herokuapp.com \
pytest -vx tests.integration/test_dataflow.py

which has successfully created (and subsequently deleted) a test PR:

https://github.com/pforgetest/test-staged-recipes/pull/28

An interesting finding from this experiment:

The pull_request.opened webhook is not forwarded to the specified review app by the dev-app-proxy GitHub App because it does not have any labels at open-time.
We cannot open and label a PR in a single API call https://github.com/orgs/community/discussions/24724.
To get the synchronize task to run, then, we'll need to make another commit to the PR after it is opened and labeled.

cisaacstern commented 1 year ago

As shown in the conversation thread on https://github.com/pforgetest/test-staged-recipes/pull/33:

Manually running this test (xref my previous comment for command) now creates a /run comment on the test PR
This issue comment event was passed along to the Heroku Review App under test, which did try to submit a Dataflow job

The job submission failed, with this being the bottom of the trace:

2023-02-08T23:56:27.881666+00:00 app[web.1]: File "/opt/app/pangeo_forge_orchestrator/routers/github_app.py", line 743, in run
2023-02-08T23:56:27.881666+00:00 app[web.1]: message = json.loads(recipe_run.message)
2023-02-08T23:56:27.881666+00:00 app[web.1]: File "/usr/lib/python3.9/json/__init__.py", line 339, in loads
2023-02-08T23:56:27.881666+00:00 app[web.1]: raise TypeError(f'the JSON object must be str, bytes or bytearray, '
2023-02-08T23:56:27.881667+00:00 app[web.1]: TypeError: the JSON object must be str, bytes or bytearray, not NoneType

I've corrected what appears to be the problem reflected in that trace in the last commit. Not sure if this is also the problem effecting the blocked PRs in #220, but its not impossible. (When I get to the point of testing against those PRs, I can temporarily revert this change, to see if it is indeed what makes the difference.)

I'm also quite confused as to why, higher up in this trace we're seeing this repo2docker/utils.py error related to one of the lines in the Dockerfile in this repo. I guess when pangeo-forge-runner is initialized from the root of this repo, it is seeing the Dockerfile and trying to ... build something from it? 🤷

repo2docker trace

``` 2023-02-08 23:56:27,858 ERROR - orchestrator - Recipe run failed with: Traceback (most recent call last): 2023-02-08T23:56:27.858721+00:00 app[web.1]: File "/usr/local/bin/pangeo-forge-runner", line 8, in 2023-02-08T23:56:27.858722+00:00 app[web.1]: sys.exit(main()) 2023-02-08T23:56:27.858723+00:00 app[web.1]: File "/usr/local/lib/python3.9/dist-packages/pangeo_forge_runner/cli.py", line 28, in main 2023-02-08T23:56:27.858723+00:00 app[web.1]: app.start() 2023-02-08T23:56:27.858724+00:00 app[web.1]: File "/usr/local/lib/python3.9/dist-packages/pangeo_forge_runner/cli.py", line 23, in start 2023-02-08T23:56:27.858724+00:00 app[web.1]: super().start() 2023-02-08T23:56:27.858724+00:00 app[web.1]: File "/usr/local/lib/python3.9/dist-packages/traitlets/config/application.py", line 473, in start 2023-02-08T23:56:27.858725+00:00 app[web.1]: return self.subapp.start() 2023-02-08T23:56:27.858725+00:00 app[web.1]: File "/usr/local/lib/python3.9/dist-packages/pangeo_forge_runner/commands/bake.py", line 118, in start 2023-02-08T23:56:27.858725+00:00 app[web.1]: self.fetch(d) 2023-02-08T23:56:27.858726+00:00 app[web.1]: File "/usr/local/lib/python3.9/dist-packages/pangeo_forge_runner/commands/base.py", line 153, in fetch 2023-02-08T23:56:27.858726+00:00 app[web.1]: for log_line in picked_content_provider.fetch( 2023-02-08T23:56:27.858726+00:00 app[web.1]: File "/usr/local/lib/python3.9/dist-packages/repo2docker/contentproviders/git.py", line 62, in fetch 2023-02-08T23:56:27.858727+00:00 app[web.1]: for line in execute_cmd( 2023-02-08T23:56:27.858727+00:00 app[web.1]: File "/usr/local/lib/python3.9/dist-packages/repo2docker/utils.py", line 74, in execute_cmd 2023-02-08T23:56:27.858728+00:00 app[web.1]: raise subprocess.CalledProcessError(ret, cmd) 2023-02-08T23:56:27.858729+00:00 app[web.1]: subprocess.CalledProcessError: Command '['git', 'reset', '--hard', 'c79e7f2ec78d1851bede55f63faa36754a225611']' returned non-zero exit status 128. 2023-02-08T23:56:27.881638+00:00 app[web.1]: [2023-02-08 23:56:27 +0000] [63] [ERROR] Exception in ASGI application 2023-02-08T23:56:27.881640+00:00 app[web.1]: Traceback (most recent call last): 2023-02-08T23:56:27.881641+00:00 app[web.1]: File "/opt/app/pangeo_forge_orchestrator/routers/github_app.py", line 718, in run 2023-02-08T23:56:27.881641+00:00 app[web.1]: out = subprocess.check_output(cmd) 2023-02-08T23:56:27.881642+00:00 app[web.1]: File "/usr/lib/python3.9/subprocess.py", line 424, in check_output 2023-02-08T23:56:27.881642+00:00 app[web.1]: return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, 2023-02-08T23:56:27.881642+00:00 app[web.1]: File "/usr/lib/python3.9/subprocess.py", line 528, in run 2023-02-08T23:56:27.881643+00:00 app[web.1]: raise CalledProcessError(retcode, process.args, 2023-02-08T23:56:27.881643+00:00 app[web.1]: subprocess.CalledProcessError: Command '['pangeo-forge-runner', 'bake', '--repo=https://github.com/pforgetest/test-staged-recipes', '--ref=c79e7f2ec78d1851bede55f63faa36754a225611', '--json', '--prune', '--Bake.recipe_id=gpcp-from-gcs', '-f=/tmp/tmppwjts7t3.json', '--feedstock-subdir=recipes/gpcp-from-gcs']' returned non-zero exit status ```

I'll remove this recipe_run.message issue from consideration first, and then see if this repo2docker thing persists. Actually, though it does seem like this may be significant, because I recall being quite confused when looking at other traces from #220 by Docker-related errors referencing lines from out Dockerfile.

cisaacstern commented 1 year ago

I'll remove this recipe_run.message issue from consideration first, and then see if this repo2docker thing persists.

I've just manually run this test against the review app built from 75a6d02, which resolved the recipe_run.message issue. The traceback from the /run command issued on https://github.com/pforgetest/test-staged-recipes/pull/34 (the automated PR made by my latest manual test run) confirm that the remaining issue is repo2docker-related:

``` 2023-02-09T17:40:45.809722+00:00 app[web.1]: 2023-02-09 17:40:45,809 ERROR - orchestrator - Recipe run failed with: Traceback (most recent call last): 2023-02-09T17:40:45.809741+00:00 app[web.1]: File "/usr/local/bin/pangeo-forge-runner", line 8, in 2023-02-09T17:40:45.809742+00:00 app[web.1]: sys.exit(main()) 2023-02-09T17:40:45.809742+00:00 app[web.1]: File "/usr/local/lib/python3.9/dist-packages/pangeo_forge_runner/cli.py", line 28, in main 2023-02-09T17:40:45.809743+00:00 app[web.1]: app.start() 2023-02-09T17:40:45.809743+00:00 app[web.1]: File "/usr/local/lib/python3.9/dist-packages/pangeo_forge_runner/cli.py", line 23, in start 2023-02-09T17:40:45.809743+00:00 app[web.1]: super().start() 2023-02-09T17:40:45.809743+00:00 app[web.1]: File "/usr/local/lib/python3.9/dist-packages/traitlets/config/application.py", line 473, in start 2023-02-09T17:40:45.809744+00:00 app[web.1]: return self.subapp.start() 2023-02-09T17:40:45.809744+00:00 app[web.1]: File "/usr/local/lib/python3.9/dist-packages/pangeo_forge_runner/commands/bake.py", line 118, in start 2023-02-09T17:40:45.809744+00:00 app[web.1]: self.fetch(d) 2023-02-09T17:40:45.809745+00:00 app[web.1]: File "/usr/local/lib/python3.9/dist-packages/pangeo_forge_runner/commands/base.py", line 153, in fetch 2023-02-09T17:40:45.809745+00:00 app[web.1]: for log_line in picked_content_provider.fetch( 2023-02-09T17:40:45.809746+00:00 app[web.1]: File "/usr/local/lib/python3.9/dist-packages/repo2docker/contentproviders/git.py", line 62, in fetch 2023-02-09T17:40:45.809746+00:00 app[web.1]: for line in execute_cmd( 2023-02-09T17:40:45.809746+00:00 app[web.1]: File "/usr/local/lib/python3.9/dist-packages/repo2docker/utils.py", line 74, in execute_cmd 2023-02-09T17:40:45.809746+00:00 app[web.1]: raise subprocess.CalledProcessError(ret, cmd) 2023-02-09T17:40:45.809750+00:00 app[web.1]: subprocess.CalledProcessError: Command '['git', 'reset', '--hard', 'b1833210a128c6602e074f52b8104130c1daef6d']' returned non-zero exit status 128. ```

I think what's happening here is that when the subprocess call is made to pangeo-forge-runner, the presences of the Dockerfile for our web app in the working directory is throwing off some inner layer of initialization of pangeo-forge-runner ... this is definitely unexpected and (to me) a highly non-obvious failure mode! I will try removing the Dockerfile from the production container (it's not needed there) and see of that resolves the problem.

Edit: Above theory is totally wrong 😆 . I now see the git reset command which failed is the one we expect this call to make, which is for reseting the clone of the feedstock repo. So this is not, in fact, some totally non-obvious code path, but actually just a failure of the obvious code path, which is attempting to clone the repo we've asked for. Will try to debug that now...

cisaacstern commented 1 year ago

Will try to debug that now...

Initial notes on this. Inside a locally running instance of the app container, git reset can be called without error both from the bash console, and from a python subprocess

root@1f008a222d9f:/opt/app/test-staged-recipes# git reset --hard 246a00f8f0939883c7919a9a9e17880888bf6087
HEAD is now at 246a00f Create meta.yaml
root@1f008a222d9f:/opt/app/test-staged-recipes# python3
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> subprocess.check_output("git reset --hard 246a00f8f0939883c7919a9a9e17880888bf6087".split())
b'HEAD is now at 246a00f Create meta.yaml\n'
>>>

so this would not appear to be an issue with a broken git installation

Edit: Confirmed with heroku run /bin/bash -a pforge-pr-226 that these same commands can run inside the deployed review app container (as opposed to just a local build).

cisaacstern commented 1 year ago

Update: disregard git confusion above! With only a minor modification, the test as written can actually submit a job to dataflow, as observed in my latest manual run, which deployed a job to dataflow from https://github.com/pforgetest/test-staged-recipes/pull/35.

The only change needed (which I have not yet pushed here) is to time.sleep(60) after making the /run comment. Previously, the fixture cleanup was deleting the test branch before the job could be deployed.

Next steps:

[ ] formalize this to get it to run from GitHub, rather than manually
[ ] feed some of the "problem" PRs mentioned in #220 into it

cisaacstern commented 1 year ago

Right now the failure point on the test is related to getting the GitHub credentials decrypted with SOPS. So far I've been doing it this way because it allows us to avoid duplicating (semi-verbose) code used in the FastAPI app for authenticating as a GitHub App.

The more I think about it though, relying on that app itself in its own test, even if the part of the app I'm relying on is not specifically what's being tested, smells like a circularity. So rather than trying to solve this SOPS thing, I'm just going to bite the bullet and duplicate some code into the test, thereby dropping this circularity concern.

cisaacstern commented 1 year ago

The last commit was the first to successfully create a PR, here: https://github.com/pforgetest/test-staged-recipes/pull/39

But for some reason the app container restarted after the synchronization task began, leaving it unfinished

2023-02-10T18:41:02.758459+00:00 app[web.1]: 2023-02-10 18:41:02,758 INFO - orchestrator - Synchronizing https://github.com/pforgetest/test-staged-recipes at 0299239b5b30e03fbc11f7572c31ce0d3963242a.
2023-02-10T18:41:02.760749+00:00 heroku[router]: at=info method=POST path="/github/hooks/" host=pforge-pr-226.herokuapp.com request_id=f136954d-0bf9-4921-a2d0-8b9fc78ca1db fwd="52.90.187.124" dyno=web.1 connect=0ms service=204ms status=202 bytes=412 protocol=https
2023-02-10T18:41:03.330084+00:00 app[api]: Release v22 created by user charlesisaacstern@gmail.com
2023-02-10T18:41:03.567705+00:00 heroku[web.1]: Restarting
2023-02-10T18:41:03.571439+00:00 heroku[web.1]: State changed from up to starting
2023-02-10T18:41:04.315253+00:00 heroku[web.1]: Stopping all processes with SIGTERM

This looks like we're not waiting quite long enough in the Is app up? step of the workflow.

cisaacstern commented 1 year ago

Seems like the failure on the last test run may have just been a fluke, as the latest push resulted in a successful job deployment, from the automated PR https://github.com/pforgetest/test-staged-recipes/pull/40

So the next thing is to build some awareness of the job completion state into the test.

cisaacstern commented 1 year ago

🎉 Calling this test locally now succeeds, with the following command

$ DEV_APP_PROXY_GITHUB_APP_PRIVATE_KEY='****' \
GH_WORKFLOW_RUN_ID=12345 \
SOURCE_REPO_FULL_NAME=pforgetest/test-staged-recipes \
SOURCE_REPO_PR_NUMBER=22 \
REVIEW_APP_URL=https://pforge-pr-226.herokuapp.com/ \
pytest -vxs tests.integration/test_dataflow.py

outputting

[...pytest boilerplate clipped...]
collecting ... 
Yielding completed_pr['head']['sha'] = '6589c0c167d808bc0eba3ad1242d919a800cb6ab' from recipes_pr fixture...
Querying review app database for recipe run id...
Found matching recipe run in review app database with recipe_run_id = 23...
Making comment on test PR with comment_body = '/run gpcp-from-gcs'
Polling review app for dataflow job submission status...
Confirmed dataflow job submitted with job_id = '2023-02-10_16_45_38-13412192724389530754'
Waiting for 5 mins, starting at utc_time = '2023-02-11 00:45:55'
Time elapsed = 300.0161998271942
Current state = 'Running'
Time elapsed = 332.52712869644165
Current state = 'Done'

tests.integration/test_dataflow.py::test_dataflow ✓                  100% ██████████

And here is the pr that was created: https://github.com/pforgetest/test-staged-recipes/pull/58

Next steps:

[x] Authenticate with gcloud in workflow so we can poll for dataflow status (this is why the integration test is failing on the latest commit)
[ ] Small (semi-unrelated) point but why are we getting redundant job status completion comments on the PR? (see https://github.com/pforgetest/test-staged-recipes/pull/58)
[ ] Actually open built dataset once job is done? This might be beyond the scope of this test?
[x] Finish documentation
[ ] Parametrize source files for test PR and see if we can reproduce errors discussed in #220

cisaacstern commented 1 year ago

Parametrize source files for test PR and see if we can reproduce errors discussed in https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/220

This is almost complete. Just making a few notes on it here so I remember where to pick up with it tomorrow (which can also be useful for forthcoming documentation).

Dataflow integration test now assumes all source PRs are active against pforgetest/test-staged-recipes
The test uses a syntax {pr_number}::{recipe_id} to figure out what pr numbers on pforgetest/test-staged-recipes to replicate for the test, and within that replicated pr, what specific recipe to run.
By default, the test runs against 22::gpcp-from-gcs
Additional test targets can be specified by adding a label to the orchestrator PR (e.g. this current PR) in the format {also-test:{pr_number}::{recipe_id}.
Currently the jobs are being spawned correctly, but there is a crash occurring where they are trying to use the same ref name to make their duplicate PRs, causing the test session which gets to the PR creation step second to error out with Reference already exists
This is because the duplicated PRs attempt to use the workflow run number for the ref name, so I just need to add one more layer of specificity here, tied to the job name somehow.

One last thought: it's possible certain PRs will require a longer Dataflow timeout, which if so, I can add as part of the label syntax, making it, e.g. {pr_number}::{recipe_id}::{timeout}s.

cisaacstern commented 1 year ago

While not every single task listed above is complete here, I'm going to merge this now, because the core tests all pass, and also because some enhancements made here to review app builds will be very useful for starting to work on #233.