Why are dataflow jobs hanging on startup?

Some basic test jobs (e.g. pruned version of NOAA OISST) that I started last night from a Heroku Review App have been hanging on cache_input with no progress for > 9 hrs.

job graph	time (9hr 20 mins, and counting)

I did change the GCS buckets that review apps use last night, in https://github.com/pangeo-forge/pangeo-forge-orchestrator/pull/152, but if there was a permissions issue with the new cache bucket or something, I'd expect that to show up as an error in cache_input.

I'm reading through https://medium.com/google-cloud/monitoring-your-dataflow-pipelines-80b9a2849f7a for ideas now, but TBH have not seen this before and I'm unsure of how to debug it.

Here's the comment that deployed this hanging job: https://github.com/pforgetest/test-staged-recipes/pull/15#issuecomment-1263316338

And here are the server logs for when the job was deployed:

``` Sep 30 02:10:13 pforge-pr-150 app/web.1 2022-09-30 09:10:12,820 INFO - orchestrator - Creating run_recipe_test task with args: ('https://github.com/cisaacstern/test-staged-recipes', 'https://api.github.com/repos/pforgetest/test-staged-recipes', '7048edcb1f1bb38a3fb518bd80fb7106b59fc473', 15, RecipeRun(feedstock_id=1, bakery_id=1, head_sha='7048edcb1f1bb38a3fb518bd80fb7106b59fc473', started_at=datetime.datetime(2022, 9, 30, 9, 9, 58), conclusion=None, is_test=True, dataset_public_url=None, id=5, recipe_id='noaa-oisst-avhrr-only', version='', completed_at=None, status='queued', dataset_type='zarr', message=None), 'pforgetest/test-staged-recipes', 'https://api.github.com/repos/pforgetest/test-staged-recipes/issues/comments/1263316338/reactions') Sep 30 02:10:13 pforge-pr-150 app/web.1 2022-09-30 09:10:13,608 INFO - orchestrator - Calling run with args, kws: ('https://github.com/cisaacstern/test-staged-recipes', '7048edcb1f1bb38a3fb518bd80fb7106b59fc473', RecipeRun(feedstock_id=1, bakery_id=1, head_sha='7048edcb1f1bb38a3fb518bd80fb7106b59fc473', started_at=datetime.datetime(2022, 9, 30, 9, 9, 58), conclusion=None, is_test=True, dataset_public_url=None, id=5, recipe_id='noaa-oisst-avhrr-only', version='', completed_at=None, status='queued', dataset_type='zarr', message=None), 'pforgetest/test-staged-recipes'), {'feedstock_subdir': 'recipes/oisst-0'} Sep 30 02:10:13 pforge-pr-150 app/web.1 2022-09-30 09:10:13,717 DEBUG - orchestrator - Dumping bakery config to json: {'Bake': {'bakery_class': 'pangeo_forge_runner.bakery.dataflow.DataflowBakery', 'job_name': 'a70666f7267652d70722d3135302e6865726f6b756170702e636f6d2505'}, 'TargetStorage': {'fsspec_class': 'gcsfs.GCSFileSystem', 'fsspec_args': {'bucket': 'pangeo-forge-dev-target', 'key': SecretStr('**********'), 'secret': SecretStr('**********')}, 'root_path': 'pangeo-forge-dev-target/pforge-pr-150/test/pforgetest/test-staged-recipes/recipe-run-5/noaa-oisst-avhrr-only.zarr', 'public_url': 'https://storage.googleapis.com/{root_path}'}, 'InputCacheStorage': {'fsspec_class': 'gcsfs.GCSFileSystem', 'fsspec_args': {'bucket': 'pangeo-forge-dev-cache'}, 'root_path': 'pangeo-forge-dev-cache'}, 'MetadataCacheStorage': {'fsspec_class': 'gcsfs.GCSFileSystem', 'fsspec_args': {}, 'root_path': 'pangeo-forge-dev-cache/metadata/pforge-pr-150/test/pforgetest/test-staged-recipes/recipe-run-5/noaa-oisst-avhrr-only.zarr'}, 'DataflowBakery': {'temp_gcs_location': 'gs://pangeo-forge-dev-dataflow/temp'}} Sep 30 02:10:13 pforge-pr-150 app/web.1 2022-09-30 09:10:13,718 DEBUG - orchestrator - Running command: ['pangeo-forge-runner', 'bake', '--repo=https://github.com/cisaacstern/test-staged-recipes', '--ref=7048edcb1f1bb38a3fb518bd80fb7106b59fc473', '--json', '--prune', '--Bake.recipe_id=noaa-oisst-avhrr-only', '-f=/tmp/tmpk6oo3e8a.json', '--feedstock-subdir=recipes/oisst-0'] Sep 30 02:10:27 pforge-pr-150 app/web.1 2022-09-30 09:10:27,051 DEBUG - orchestrator - Command output is {"message": "Config option `public_url` not recognized by `TargetStorage`."} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Config option `public_url` not recognized by `InputCacheStorage`."} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Config option `public_url` not recognized by `MetadataCacheStorage`."} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Target Storage is FSSpecTarget(GCSFileSystem(bucket=, key=, secret=, root_path=\"pangeo-forge-dev-target/pforge-pr-150/test/pforgetest/test-staged-recipes/recipe-run-5/noaa-oisst-avhrr-only.zarr\")\n", "status": "setup"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Input Cache Storage is CacheFSSpecTarget(GCSFileSystem(bucket=, root_path=\"pangeo-forge-dev-cache\")\n", "status": "setup"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Metadata Cache Storage is MetadataTarget(GCSFileSystem(, root_path=\"pangeo-forge-dev-cache/metadata/pforge-pr-150/test/pforgetest/test-staged-recipes/recipe-run-5/noaa-oisst-avhrr-only.zarr\")\n", "status": "setup"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Picked Git content provider.\n", "status": "fetching"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Cloning into '/tmp/tmpe18xc7gw'...\n", "status": "fetching"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "HEAD is now at 7048edc fix url\n", "status": "fetching"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Parsing recipes...", "status": "running"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Baking only recipe_id='noaa-oisst-avhrr-only'"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Using Any for unsupported type: typing.Callable[[typing.Any], NoneType]"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Using Any for unsupported type: typing.Callable[[typing.Any], NoneType]"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Downloading source distribution of the SDK from PyPi"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Executing command: ['/usr/bin/python3.9', '-m', 'pip', 'download', '--dest', '/tmp/tmpl4irxras', 'apache-beam==2.41.0', '--no-deps', '--no-binary', ':all:']"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Staging SDK sources from PyPI: dataflow_python_sdk.tar"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Downloading binary distribution of the SDK from PyPi"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Executing command: ['/usr/bin/python3.9', '-m', 'pip', 'download', '--dest', '/tmp/tmpl4irxras', 'apache-beam==2.41.0', '--no-deps', '--only-binary', ':all:', '--python-version', '39', '--implementation', 'cp', '--abi', 'cp39', '--platform', 'manylinux1_x86_64']"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Staging binary distribution of the SDK from PyPI: apache_beam-2.41.0-cp39-cp39-manylinux1_x86_64.whl"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Pipeline has additional dependencies to be installed in SDK worker container, consider using the SDK container image pre-building workflow to avoid repetitive installations. Learn more on https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Default Python SDK image for environment is apache/beam_python3.9_sdk:2.41.0"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Using provided Python SDK container image: pangeo/forge:8a862dc"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Python SDK container image set to \"pangeo/forge:8a862dc\" for Docker environment"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "==================== ===================="} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "==================== ===================="} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Defaulting to the temp_location as staging_location: gs://pangeo-forge-dev-dataflow/temp"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Setting socket default timeout to 60 seconds."} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "socket default timeout is 60.0 seconds."} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Starting GCS upload to gs://pangeo-forge-dev-dataflow/temp/a70666f7267652d70722d3135302e6865726f6b756170702e636f6d2505.1664529023.140838/dataflow_python_sdk.tar..."} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Completed GCS upload to gs://pangeo-forge-dev-dataflow/temp/a70666f7267652d70722d3135302e6865726f6b756170702e636f6d2505.1664529023.140838/dataflow_python_sdk.tar in 0 seconds."} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Starting GCS upload to gs://pangeo-forge-dev-dataflow/temp/a70666f7267652d70722d3135302e6865726f6b756170702e636f6d2505.1664529023.140838/apache_beam-2.41.0-cp39-cp39-manylinux1_x86_64.whl..."} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Completed GCS upload to gs://pangeo-forge-dev-dataflow/temp/a70666f7267652d70722d3135302e6865726f6b756170702e636f6d2505.1664529023.140838/apache_beam-2.41.0-cp39-cp39-manylinux1_x86_64.whl in 1 seconds."} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Starting GCS upload to gs://pangeo-forge-dev-dataflow/temp/a70666f7267652d70722d3135302e6865726f6b756170702e636f6d2505.1664529023.140838/pipeline.pb..."} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Completed GCS upload to gs://pangeo-forge-dev-dataflow/temp/a70666f7267652d70722d3135302e6865726f6b756170702e636f6d2505.1664529023.140838/pipeline.pb in 0 seconds."} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Create job: "} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Created job with id: [2022-09-30_02_10_26-11634697841644863244]"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Submitted job: 2022-09-30_02_10_26-11634697841644863244"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "To access the Dataflow monitoring console, please navigate to https://console.cloud.google.com/dataflow/jobs/us-central1/2022-09-30_02_10_26-11634697841644863244?project=pangeo-forge-4967"} Sep 30 02:10:27 pforge-pr-150 app/web.1 {"message": "Submitted job 2022-09-30_02_10_26-11634697841644863244 for recipe noaa-oisst-avhrr-only", "recipe": "noaa-oisst-avhrr-only", "job_name": "a70666f7267652d70722d3135302e6865726f6b756170702e636f6d2505", "job_id": "2022-09-30_02_10_26-11634697841644863244", "status": "submitted"} ```

Currently using https://github.com/yuvipanda/mybinder-analytics/blob/main/logs.py to debug this... @yuvipanda this tool is amazing, I will definitely add it to #150 once I figure out what's going on here.

I've changed the title of this issue to reflect that it appears the jobs are not even starting. There are no worker logs for these jobs, digging into the kubelet, harness, etc. layers now.

I have a theory! Possible that a minor version mismatch between

parsing env: 2.41, see below, this dep is unpinned in pangeo-forge-runner, which brings it in orchestrator!
and worker image (2.40)

is to blame?

As nice as the 2.41/2.40 incompatibility theory is, that doesn't seem to be the issue. I've just:

docker run -it --entrypoint=/bin/bash pangeo/forge:8a862dc
Made the absolute minimum changes to the env in that container, to get pangeo-forge-runner to run, namely:
1. pip install pangeo-forge-runner --no-deps
2. pip install jupyter-repo2docker

Wrote this pangeo-forge-runner config to c.json:

{"Bake": {"bakery_class": "pangeo_forge_runner.bakery.dataflow.DataflowBakery", "job_name": "cisaacstern-gpcp-from-gcs-test-4"}, "TargetStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {"bucket": "pangeo-forge-dev-target"}, "root_path": "pangeo-forge-dev-target/cisaacstern-gpcp-from-gcs-test-4", "public_url": "https://storage.googleapis.com/{root_path}"}, "InputCacheStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {"bucket": "pangeo-forge-dev-cache"}, "root_path": "pangeo-forge-dev-cache"}, "MetadataCacheStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {}, "root_path": "pangeo-forge-dev-cache/metadata/cisaacstern-gpcp-from-gcs-test-4"}, "DataflowBakery": {"temp_gcs_location": "gs://pangeo-forge-dev-dataflow/temp"}}

Deployed to https://github.com/pforgetest/gpcp-from-gcs-feedstock to Dataflow, from within the container:

$ pangeo-forge-runner bake --repo https://github.com/pforgetest/gpcp-from-gcs-feedstock.git --ref 4f41e02512b2078c8bdb286368a1a9d878b5cec2 --json -f c.json --DataflowBakery.project_id="pangeo-forge-4967" --DataflowBakery.service_account_email="pangeo-forge-dataflow@pangeo-forge-4967.iam.gserviceaccount.com"

This job was deployed from (with the exception of the two pip installs mentioned above) the exact same image as is run on the Dataflow workers. Despite that, it's still hanging 25 mins after startup.

Here is a pip list of each of these two containers:

And here is a diff of those two files:

diff pip-list-worker-img-base.txt pip-list-worker-img-plus-extra-installs-txt
54a55
> chardet                         5.0.0
94a96
> escapism                        1.0.1
178a181
> iso8601                         1.1.0
188a192
> jupyter-repo2docker             2022.2.0
258a263
> pangeo-forge-runner             0.5
352a358
> semver                          2.13.0

These differences do not strike me as capable of causing the startup issues we're seeing.

Via the process of elimination, I'm left to conclude that the default pangeo/forge:8a862dc container image currently hardcoded in the latest pangeo-forge-runner==0.5 release is no longer functional on Dataflow. Assuming this is true, 🚨 no recipes deployed from GitHub by Pangeo Forge Cloud will succeed until this is fixed. 🚨 I'll explain my reasoning here.

Following https://cloud.google.com/dataflow/docs/guides/using-custom-containers#use_an_base_image, I wrote:

# Dockerfile

FROM apache/beam_python3.9_sdk:2.41.0

RUN pip install pangeo-forge-recipes pandas gcsfs

and pushed this image to gcr.io

$ export PROJECT=pangeo-forge-4967
$ export REPO=custom-beam-images
$ export TAG=3.9-2.41.0
$ export IMAGE_URI=gcr.io/$PROJECT/$REPO:$TAG
$ gcloud builds submit . --tag $IMAGE_URI

I then deployed https://github.com/pforgetest/gpcp-from-gcs-feedstock to Dataflow, with the same c.json config listed in https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/156#issuecomment-1264090513, and with the only difference being specification of this custom beam image:

$ pangeo-forge-runner bake \
--repo https://github.com/pforgetest/gpcp-from-gcs-feedstock.git \
--ref 4f41e02512b2078c8bdb286368a1a9d878b5cec2 \
--json \
-f c.json \
--DataflowBakery.project_id="pangeo-forge-4967" \
--DataflowBakery.service_account_email="pangeo-forge-dataflow@pangeo-forge-4967.iam.gserviceaccount.com" \
--Bake.container_image="gcr.io/pangeo-forge-4967/custom-beam-images:3.9-2.41.0"

Note: To do this, I lifted --Bake.container_image into a configurable attribute (whereas it is hardcoded in the latest pangeo-forge-runner==0.5 release). I will make a PR to pangeo-forge-runner with this change momentarily.

This was the only job of the dozen or so jobs I've submitted in the last ~24 hrs to succeed:

Importantly, actually, it was the only job I've submitted in the last ~24 hrs that started, whereas all others hung indefinitely while "waiting for the sdk harness to register", e.g.:

[2022-09-30T21:13:52.126125+00:00] [harness:cisaacstern-gpcp-from-gcs-09301405-sajs-harness-8pxv] Waiting for 2 SDK Harnesses to register.
[2022-09-30T21:14:02.868192+00:00] [harness:cisaacstern-gpcp-from-gcs-09301405-sajs-harness-8pxv] Waiting for 2 SDK Harnesses to register.
[2022-09-30T21:14:16.064972+00:00] [harness:cisaacstern-gpcp-from-gcs-09301405-sajs-harness-8pxv] Waiting for 2 SDK Harnesses to register.
[2022-09-30T21:14:27.910805+00:00] [harness:cisaacstern-gpcp-from-gcs-09301405-sajs-harness-8pxv] Waiting for 2 SDK Harnesses to register.
[2022-09-30T21:14:40.922696+00:00] [harness:cisaacstern-gpcp-from-gcs-09301405-sajs-harness-8pxv] Waiting for 2 SDK Harnesses to register.

This failure mode suggests to me some problem with in the default pangeo/forge:8a862dc image with the beam ENTRYPOINT.

Once this is fixed upstream in pangeo-forge-runner (and maybe https://github.com/pangeo-data/pangeo-docker-images), Dataflow job submission via Pangeo Forge Cloud should presumably be repairable by upgrading

https://github.com/pangeo-forge/pangeo-forge-orchestrator/blob/8cb43f53a95514fd3120103da0b3a30caae078ad/requirements.txt#L18

and then following the release cycle by merging the upgraded requirements.txt to main followed by prod.

On further thought, it's possible (likely, even) that no changes to https://github.com/pangeo-data/pangeo-docker-images are necessary, and rather simply re-pinning the tag of pangeo/forge:8a862dc to the latest build of https://github.com/pangeo-data/pangeo-docker-images/tree/master/forge would solve this problem. Presumably, a more recent build of this image has brought in whatever change to the official beam ENTRYPOINT is necessary to register jobs on Dataflow.

I got tests in https://github.com/pangeo-forge/pangeo-forge-runner/pull/35 to pass, merged it and made a release. Also made https://github.com/pangeo-data/pangeo-docker-images/pull/379 to bump beam version in the image itself

@yuvipanda, for some reason Dataflow workers are not able to pull the new default pangeo/forge:7c87e6c image.

Using the same c.json from https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/156#issuecomment-1264090513 (with just the job_name and storage root_paths changed):

{"Bake": {"bakery_class": "pangeo_forge_runner.bakery.dataflow.DataflowBakery", "job_name": "cisaacstern-gpcp-from-gcs-test-7"}, "TargetStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {"bucket": "pangeo-forge-dev-target"}, "root_path": "pangeo-forge-dev-target/cisaacstern-gpcp-from-gcs-test-7", "public_url": "https://storage.googleapis.com/{root_path}"}, "InputCacheStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {"bucket": "pangeo-forge-dev-cache"}, "root_path": "pangeo-forge-dev-cache"}, "MetadataCacheStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {}, "root_path": "pangeo-forge-dev-cache/metadata/cisaacstern-gpcp-from-gcs-test-7"}, "DataflowBakery": {"temp_gcs_location": "gs://pangeo-forge-dev-dataflow/temp"}}

And the same pangeo-forge-runner command as in https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/156#issuecomment-1264175655 (minus the --Bake.container_image option):

$ pangeo-forge-runner bake \
--repo https://github.com/pforgetest/gpcp-from-gcs-feedstock.git \
--ref 4f41e02512b2078c8bdb286368a1a9d878b5cec2 \
--json \
-f c.json \
--DataflowBakery.project_id="pangeo-forge-4967" \
--DataflowBakery.service_account_email="pangeo-forge-dataflow@pangeo-forge-4967.iam.gserviceaccount.com"

The pangeo-forge-runner (version 0.6.1) logging confirms that Beam is picking up the new default image:

...
{"message": "Default Python SDK image for environment is apache/beam_python3.9_sdk:2.41.0"}
{"message": "Using provided Python SDK container image: pangeo/forge:7c87e6c"}
{"message": "Python SDK container image set to \"pangeo/forge:7c87e6c\" for Docker environment"}
...

Then on Dataflow, workers fail to start with:

2022-10-03 10:30:38.775 PDTA worker was unable to start up. Error: Unable to pull container image due to error: image pull request failed with error: Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded (Client.Timeout exceeded while awaiting headers). This is likely due to an invalid SDK container image URL. Please verify any provided SDK container image is valid and that Dataflow workers have permissions to pull image.

At first I though that

This is likely due to an invalid SDK container image URL.

meant I just needed to prepend docker.io/ to the container image name, but I tried that as well, and it did not appear to work.

Strangely,

This image can be pulled locally with docker pull, so it's not that it doesn't exist
Aside from lifting this into a configurable attribute, I don't see how this is any different from our prior approach in pangeo-forge-runner. The previously hardcoded image URL was of the same format as this one, and yet did not have any problems.

Is it possible that the prior hardcoded image was on gcr.io, but pangeo/forge:7c87e6c is not?

Is it possible that the prior hardcoded image was on gcr.io, but pangeo/forge:7c87e6c is not?

This has been ruled out. All images have always been hosted on Docker Hub.

Here's some relevant logging from the docker source of the failed job

[2022-10-03T17:38:44.780940+00:00] [docker:cisaacstern-gpcp-from-gcs-10031028-4ccj-harness-dqt5] Attempting next endpoint for pull after error: Get \"https://registry-1.docker.io/v2/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
[2022-10-03T17:38:44.783500+00:00] [docker:cisaacstern-gpcp-from-gcs-10031028-4ccj-harness-dqt5] Handler for POST /v1.41/images/create returned error: Get \"https://registry-1.docker.io/v2/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

which suggests some version of

https://stackoverflow.com/questions/57570423/net-http-request-canceled-while-waiting-for-connection-client-timeout-exceeded

See https://github.com/pangeo-forge/pangeo-forge-orchestrator/pull/157#issuecomment-1266225432 for current workaround to this Docker Hub pull issue, and link to upstream issue.

pangeo-forge / pangeo-forge-orchestrator

Why are dataflow jobs hanging on startup? #156