Closed cisaacstern closed 2 years ago
Currently using https://github.com/yuvipanda/mybinder-analytics/blob/main/logs.py to debug this... @yuvipanda this tool is amazing, I will definitely add it to #150 once I figure out what's going on here.
I've changed the title of this issue to reflect that it appears the jobs are not even starting. There are no worker logs for these jobs, digging into the kubelet
, harness
, etc. layers now.
I have a theory! Possible that a minor version mismatch between
2.41
, see below, this dep is unpinned in pangeo-forge-runner
, which brings it in orchestrator
!2.40
)is to blame?
As nice as the 2.41
/2.40
incompatibility theory is, that doesn't seem to be the issue. I've just:
docker run -it --entrypoint=/bin/bash pangeo/forge:8a862dc
pangeo-forge-runner
to run, namely:
pip install pangeo-forge-runner --no-deps
pip install jupyter-repo2docker
pangeo-forge-runner
config to c.json
:
{"Bake": {"bakery_class": "pangeo_forge_runner.bakery.dataflow.DataflowBakery", "job_name": "cisaacstern-gpcp-from-gcs-test-4"}, "TargetStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {"bucket": "pangeo-forge-dev-target"}, "root_path": "pangeo-forge-dev-target/cisaacstern-gpcp-from-gcs-test-4", "public_url": "https://storage.googleapis.com/{root_path}"}, "InputCacheStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {"bucket": "pangeo-forge-dev-cache"}, "root_path": "pangeo-forge-dev-cache"}, "MetadataCacheStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {}, "root_path": "pangeo-forge-dev-cache/metadata/cisaacstern-gpcp-from-gcs-test-4"}, "DataflowBakery": {"temp_gcs_location": "gs://pangeo-forge-dev-dataflow/temp"}}
$ pangeo-forge-runner bake --repo https://github.com/pforgetest/gpcp-from-gcs-feedstock.git --ref 4f41e02512b2078c8bdb286368a1a9d878b5cec2 --json -f c.json --DataflowBakery.project_id="pangeo-forge-4967" --DataflowBakery.service_account_email="pangeo-forge-dataflow@pangeo-forge-4967.iam.gserviceaccount.com"
This job was deployed from (with the exception of the two pip installs mentioned above) the exact same image as is run on the Dataflow workers. Despite that, it's still hanging 25 mins after startup.
Here is a pip list
of each of these two containers:
And here is a diff of those two files:
diff pip-list-worker-img-base.txt pip-list-worker-img-plus-extra-installs-txt
54a55
> chardet 5.0.0
94a96
> escapism 1.0.1
178a181
> iso8601 1.1.0
188a192
> jupyter-repo2docker 2022.2.0
258a263
> pangeo-forge-runner 0.5
352a358
> semver 2.13.0
These differences do not strike me as capable of causing the startup issues we're seeing.
Via the process of elimination, I'm left to conclude that the default pangeo/forge:8a862dc
container image currently hardcoded in the latest pangeo-forge-runner==0.5
release is no longer functional on Dataflow. Assuming this is true, 🚨 no recipes deployed from GitHub by Pangeo Forge Cloud will succeed until this is fixed. 🚨 I'll explain my reasoning here.
Following https://cloud.google.com/dataflow/docs/guides/using-custom-containers#use_an_base_image, I wrote:
# Dockerfile
FROM apache/beam_python3.9_sdk:2.41.0
RUN pip install pangeo-forge-recipes pandas gcsfs
and pushed this image to gcr.io
$ export PROJECT=pangeo-forge-4967
$ export REPO=custom-beam-images
$ export TAG=3.9-2.41.0
$ export IMAGE_URI=gcr.io/$PROJECT/$REPO:$TAG
$ gcloud builds submit . --tag $IMAGE_URI
I then deployed https://github.com/pforgetest/gpcp-from-gcs-feedstock to Dataflow, with the same c.json
config listed in https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/156#issuecomment-1264090513, and with the only difference being specification of this custom beam image:
$ pangeo-forge-runner bake \
--repo https://github.com/pforgetest/gpcp-from-gcs-feedstock.git \
--ref 4f41e02512b2078c8bdb286368a1a9d878b5cec2 \
--json \
-f c.json \
--DataflowBakery.project_id="pangeo-forge-4967" \
--DataflowBakery.service_account_email="pangeo-forge-dataflow@pangeo-forge-4967.iam.gserviceaccount.com" \
--Bake.container_image="gcr.io/pangeo-forge-4967/custom-beam-images:3.9-2.41.0"
Note: To do this, I lifted
--Bake.container_image
into a configurable attribute (whereas it is hardcoded in the latestpangeo-forge-runner==0.5
release). I will make a PR topangeo-forge-runner
with this change momentarily.
This was the only job of the dozen or so jobs I've submitted in the last ~24 hrs to succeed:
Importantly, actually, it was the only job I've submitted in the last ~24 hrs that started, whereas all others hung indefinitely while "waiting for the sdk harness to register", e.g.:
[2022-09-30T21:13:52.126125+00:00] [harness:cisaacstern-gpcp-from-gcs-09301405-sajs-harness-8pxv] Waiting for 2 SDK Harnesses to register.
[2022-09-30T21:14:02.868192+00:00] [harness:cisaacstern-gpcp-from-gcs-09301405-sajs-harness-8pxv] Waiting for 2 SDK Harnesses to register.
[2022-09-30T21:14:16.064972+00:00] [harness:cisaacstern-gpcp-from-gcs-09301405-sajs-harness-8pxv] Waiting for 2 SDK Harnesses to register.
[2022-09-30T21:14:27.910805+00:00] [harness:cisaacstern-gpcp-from-gcs-09301405-sajs-harness-8pxv] Waiting for 2 SDK Harnesses to register.
[2022-09-30T21:14:40.922696+00:00] [harness:cisaacstern-gpcp-from-gcs-09301405-sajs-harness-8pxv] Waiting for 2 SDK Harnesses to register.
This failure mode suggests to me some problem with in the default pangeo/forge:8a862dc
image with the beam ENTRYPOINT
.
Once this is fixed upstream in pangeo-forge-runner
(and maybe https://github.com/pangeo-data/pangeo-docker-images), Dataflow job submission via Pangeo Forge Cloud should presumably be repairable by upgrading
and then following the release cycle by merging the upgraded requirements.txt to main
followed by prod
.
On further thought, it's possible (likely, even) that no changes to https://github.com/pangeo-data/pangeo-docker-images are necessary, and rather simply re-pinning the tag of pangeo/forge:8a862dc
to the latest build of https://github.com/pangeo-data/pangeo-docker-images/tree/master/forge would solve this problem. Presumably, a more recent build of this image has brought in whatever change to the official beam ENTRYPOINT
is necessary to register jobs on Dataflow.
I got tests in https://github.com/pangeo-forge/pangeo-forge-runner/pull/35 to pass, merged it and made a release. Also made https://github.com/pangeo-data/pangeo-docker-images/pull/379 to bump beam version in the image itself
@yuvipanda, for some reason Dataflow workers are not able to pull the new default pangeo/forge:7c87e6c
image.
Using the same c.json
from https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/156#issuecomment-1264090513 (with just the job_name
and storage root_path
s changed):
{"Bake": {"bakery_class": "pangeo_forge_runner.bakery.dataflow.DataflowBakery", "job_name": "cisaacstern-gpcp-from-gcs-test-7"}, "TargetStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {"bucket": "pangeo-forge-dev-target"}, "root_path": "pangeo-forge-dev-target/cisaacstern-gpcp-from-gcs-test-7", "public_url": "https://storage.googleapis.com/{root_path}"}, "InputCacheStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {"bucket": "pangeo-forge-dev-cache"}, "root_path": "pangeo-forge-dev-cache"}, "MetadataCacheStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {}, "root_path": "pangeo-forge-dev-cache/metadata/cisaacstern-gpcp-from-gcs-test-7"}, "DataflowBakery": {"temp_gcs_location": "gs://pangeo-forge-dev-dataflow/temp"}}
And the same pangeo-forge-runner
command as in https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/156#issuecomment-1264175655 (minus the --Bake.container_image
option):
$ pangeo-forge-runner bake \
--repo https://github.com/pforgetest/gpcp-from-gcs-feedstock.git \
--ref 4f41e02512b2078c8bdb286368a1a9d878b5cec2 \
--json \
-f c.json \
--DataflowBakery.project_id="pangeo-forge-4967" \
--DataflowBakery.service_account_email="pangeo-forge-dataflow@pangeo-forge-4967.iam.gserviceaccount.com"
The pangeo-forge-runner
(version 0.6.1
) logging confirms that Beam is picking up the new default image:
...
{"message": "Default Python SDK image for environment is apache/beam_python3.9_sdk:2.41.0"}
{"message": "Using provided Python SDK container image: pangeo/forge:7c87e6c"}
{"message": "Python SDK container image set to \"pangeo/forge:7c87e6c\" for Docker environment"}
...
Then on Dataflow, workers fail to start with:
2022-10-03 10:30:38.775 PDTA worker was unable to start up. Error: Unable to pull container image due to error: image pull request failed with error: Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded (Client.Timeout exceeded while awaiting headers). This is likely due to an invalid SDK container image URL. Please verify any provided SDK container image is valid and that Dataflow workers have permissions to pull image.
At first I though that
This is likely due to an invalid SDK container image URL.
meant I just needed to prepend docker.io/
to the container image name, but I tried that as well, and it did not appear to work.
Strangely,
docker pull
, so it's not that it doesn't existpangeo-forge-runner
. The previously hardcoded image URL was of the same format as this one, and yet did not have any problems.Is it possible that the prior hardcoded image was on gcr.io, but pangeo/forge:7c87e6c
is not?
Is it possible that the prior hardcoded image was on gcr.io, but pangeo/forge:7c87e6c is not?
This has been ruled out. All images have always been hosted on Docker Hub.
Here's some relevant logging from the docker
source of the failed job
[2022-10-03T17:38:44.780940+00:00] [docker:cisaacstern-gpcp-from-gcs-10031028-4ccj-harness-dqt5] Attempting next endpoint for pull after error: Get \"https://registry-1.docker.io/v2/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
[2022-10-03T17:38:44.783500+00:00] [docker:cisaacstern-gpcp-from-gcs-10031028-4ccj-harness-dqt5] Handler for POST /v1.41/images/create returned error: Get \"https://registry-1.docker.io/v2/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
which suggests some version of
See https://github.com/pangeo-forge/pangeo-forge-orchestrator/pull/157#issuecomment-1266225432 for current workaround to this Docker Hub pull issue, and link to upstream issue.
Some basic test jobs (e.g. pruned version of NOAA OISST) that I started last night from a Heroku Review App have been hanging on
cache_input
with no progress for > 9 hrs.I did change the GCS buckets that review apps use last night, in https://github.com/pangeo-forge/pangeo-forge-orchestrator/pull/152, but if there was a permissions issue with the new cache bucket or something, I'd expect that to show up as an error in cache_input.
I'm reading through https://medium.com/google-cloud/monitoring-your-dataflow-pipelines-80b9a2849f7a for ideas now, but TBH have not seen this before and I'm unsure of how to debug it.
Here's the comment that deployed this hanging job: https://github.com/pforgetest/test-staged-recipes/pull/15#issuecomment-1263316338
And here are the server logs for when the job was deployed: