pangeo-forge / pangeo-forge-orchestrator

Database API and GitHub App backend for Pangeo Forge Cloud.
https://api.pangeo-forge.org/docs
Apache License 2.0
4 stars 1 forks source link

Document how to re-run failed `pangeo-forge-runner` cmds from server logs locally #148

Closed cisaacstern closed 10 months ago

cisaacstern commented 2 years ago

_Originally posted by @cisaacstern in https://github.com/pangeo-forge/staged-recipes/pull/169#discussion_r983005046, in discussion with @andersy005 about how to debug failed /run commands on staged-recipes, when the issue is a CalledProcessError on the call to pangeo-forge-runner. This important debugging trick should be added to docs/README.md._

So what I did was watch the server logs right after /runing the test. And I saw:

2022-09-29T02:02:56.875589+00:00 app[web.1]: 2022-09-29 02:02:56,875 DEBUG - orchestrator - Dumping bakery config to json: {'Bake': {'bakery_class': 'pangeo_forge_runner.bakery.dataflow.DataflowBakery', 'job_name': 'a6170692e70616e67656f2d666f7267652e6f7267251191'}, 'TargetStorage': {'fsspec_class': 's3fs.S3FileSystem', 'fsspec_args': {'client_kwargs': {'endpoint_url': 'https://ncsa.osn.xsede.org'}, 'default_cache_type': 'none', 'default_fill_cache': False, 'use_listings_cache': False, 'key': SecretStr('**********'), 'secret': SecretStr('**********')}, 'root_path': 'Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1191/eobs-wind-speed.zarr', 'public_url': 'https://ncsa.osn.xsede.org/{root_path}'}, 'InputCacheStorage': {'fsspec_class': 'gcsfs.GCSFileSystem', 'fsspec_args': {'bucket': 'pangeo-forge-prod-cache'}, 'root_path': 'pangeo-forge-prod-cache'}, 'MetadataCacheStorage': {'fsspec_class': 'gcsfs.GCSFileSystem', 'fsspec_args': {}, 'root_path': 'pangeo-forge-prod-cache/metadata/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1191/eobs-wind-speed.zarr'}, 'DataflowBakery': {'temp_gcs_location': 'gs://pangeo-forge-prod-dataflow/temp'}}
2022-09-29T02:02:56.877275+00:00 app[web.1]: 2022-09-29 02:02:56,877 DEBUG - orchestrator - Running command: ['pangeo-forge-runner', 'bake', '--repo=https://github.com/norlandrhagen/staged-recipes', '--ref=be1d60de2f3b9aed5fe480328b9d60e1ef0694ef', '--json', '--prune', '--Bake.recipe_id=eobs-wind-speed', '-f=/tmp/tmplol5klds.json', '--feedstock-subdir=recipes/EOBS']
2022-09-29T02:02:58.116682+00:00 heroku[web.1]: Process running mem=647M(120.0%)
2022-09-29T02:02:58.143937+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2022-09-29T02:03:00.449388+00:00 app[web.1]: [2022-09-29 02:03:00 +0000] [60] [ERROR] Exception in ASGI application
2022-09-29T02:03:00.449397+00:00 app[web.1]: Traceback (most recent call last):
2022-09-29T02:03:00.449398+00:00 app[web.1]: File "/opt/app/pangeo_forge_orchestrator/routers/github_app.py", line 621, in run
2022-09-29T02:03:00.449398+00:00 app[web.1]: out = subprocess.check_output(cmd)
2022-09-29T02:03:00.449399+00:00 app[web.1]: File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
2022-09-29T02:03:00.449399+00:00 app[web.1]: return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
2022-09-29T02:03:00.449400+00:00 app[web.1]: File "/usr/lib/python3.9/subprocess.py", line 528, in run
2022-09-29T02:03:00.449400+00:00 app[web.1]: raise CalledProcessError(retcode, process.args,
2022-09-29T02:03:00.449414+00:00 app[web.1]: subprocess.CalledProcessError: Command '['pangeo-forge-runner', 'bake', '--repo=https://github.com/norlandrhagen/staged-recipes', '--ref=be1d60de2f3b9aed5fe480328b9d60e1ef0694ef', '--json', '--prune', '--Bake.recipe_id=eobs-wind-speed', '-f=/tmp/tmplol5klds.json', '--feedstock-subdir=recipes/EOBS']' returned non-zero exit status 1.

But I didn't get much detail on what the called process error issue actually was. So I copy-and-pasted the config in the first log line

2022-09-29T02:02:56.875589+00:00 app[web.1]: 2022-09-29 02:02:56,875 DEBUG - orchestrator - Dumping bakery config to json: {'Bake': {'bakery_class': 'pangeo_forge_runner.bakery.dataflow.DataflowBakery', 'job_name': 'a6170692e70616e67656f2d666f7267652e6f7267251191'}, 'TargetStorage': {'fsspec_class': 's3fs.S3FileSystem', 'fsspec_args': {'client_kwargs': {'endpoint_url': 'https://ncsa.osn.xsede.org'}, 'default_cache_type': 'none', 'default_fill_cache': False, 'use_listings_cache': False, 'key': SecretStr('**********'), 'secret': SecretStr('**********')}, 'root_path': 'Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1191/eobs-wind-speed.zarr', 'public_url': 'https://ncsa.osn.xsede.org/{root_path}'}, 'InputCacheStorage': {'fsspec_class': 'gcsfs.GCSFileSystem', 'fsspec_args': {'bucket': 'pangeo-forge-prod-cache'}, 'root_path': 'pangeo-forge-prod-cache'}, 'MetadataCacheStorage': {'fsspec_class': 'gcsfs.GCSFileSystem', 'fsspec_args': {}, 'root_path': 'pangeo-forge-prod-cache/metadata/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1191/eobs-wind-speed.zarr'}, 'DataflowBakery': {'temp_gcs_location': 'gs://pangeo-forge-prod-dataflow/temp'}}

into a local JSON file on my laptop

{"Bake": {"bakery_class": "pangeo_forge_runner.bakery.dataflow.DataflowBakery", "job_name": "a6170692e70616e67656f2d666f7267652e6f7267251191"}, "TargetStorage": {"fsspec_class": "s3fs.S3FileSystem", "fsspec_args": {"client_kwargs": {"endpoint_url": "https://ncsa.osn.xsede.org"}, "default_cache_type": "none", "default_fill_cache": false, "use_listings_cache": false, "key": "**********", "secret": "**********"}, "root_path": "Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1191/eobs-wind-speed.zarr", "public_url": "https://ncsa.osn.xsede.org/{root_path}"}, "InputCacheStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {"bucket": "pangeo-forge-prod-cache"}, "root_path": "pangeo-forge-prod-cache"}, "MetadataCacheStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {}, "root_path": "pangeo-forge-prod-cache/metadata/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1191/eobs-wind-speed.zarr"}, "DataflowBakery": {"temp_gcs_location": "gs://pangeo-forge-prod-dataflow/temp"}}

and then I copied the pangeo-forge-runner command from the second log line

2022-09-29T02:02:56.877275+00:00 app[web.1]: 2022-09-29 02:02:56,877 DEBUG - orchestrator - Running command: ['pangeo-forge-runner', 'bake', '--repo=https://github.com/norlandrhagen/staged-recipes', '--ref=be1d60de2f3b9aed5fe480328b9d60e1ef0694ef', '--json', '--prune', '--Bake.recipe_id=eobs-wind-speed', '-f=/tmp/tmplol5klds.json', '--feedstock-subdir=recipes/EOBS']

and, replacing '-f=/tmp/tmplol5klds.json' with the path to my local JSON config (c.json), ran

$ pangeo-forge-runner bake --repo=https://github.com/norlandrhagen/staged-recipes --ref=be1d60de2f3b9aed5fe480328b9d60e1ef0694ef --json --prune --Bake.recipe_id=eobs-wind-speed -f=c.json --feedstock-subdir=recipes/EOBS

which gave me a descriptive error

{"message": "Error during running: 'eobs-tg-tn-tx-rr-hu-pp'", "exc_info": "Traceback (most recent call last):\n  File \"/Users/charlesstern/miniconda3/envs/pfo-new/bin/pangeo-forge-runner\", line 8, in <module>\n    sys.exit(main())\n  File \"/Users/charlesstern/miniconda3/envs/pfo-new/lib/python3.9/site-packages/pangeo_forge_runner/cli.py\", line 28, in main\n    app.start()\n  File \"/Users/charlesstern/miniconda3/envs/pfo-new/lib/python3.9/site-packages/pangeo_forge_runner/cli.py\", line 23, in start\n    super().start()\n  File \"/Users/charlesstern/miniconda3/envs/pfo-new/lib/python3.9/site-packages/traitlets/config/application.py\", line 462, in start\n    return self.subapp.start()\n  File \"/Users/charlesstern/miniconda3/envs/pfo-new/lib/python3.9/site-packages/pangeo_forge_runner/commands/bake.py\", line 112, in start\n    recipes = feedstock.parse_recipes()\n  File \"/Users/charlesstern/miniconda3/envs/pfo-new/lib/python3.9/site-packages/pangeo_forge_runner/feedstock.py\", line 55, in parse_recipes\n    recipes[r[\"id\"]] = self._import(r[\"object\"])\n  File \"/Users/charlesstern/miniconda3/envs/pfo-new/lib/python3.9/site-packages/pangeo_forge_runner/feedstock.py\", line 43, in _import\n    return self._import_cache[module][export]\nKeyError: 'eobs-tg-tn-tx-rr-hu-pp'", "status": "failed"}