Open thenaomig opened 1 year ago
@thenaomig and I have been working on this recipe at the post AMS Pangeo meeting :)
pre-commit.ci autofix
pre-commit.ci autofix
/run cmip6-wrf-wus
@andersy005 unfortunately, the backend service is currently broken, following my failed attempt to upgrade the pangeo-forge-recipes
version used there. I am working on getting it fixed and will ping you here once that's the case. (Currently, jobs will submit, but they will fail because of version mismatching between the backend service client and Dataflow workers.)
thank you for the heads up, @cisaacstern! yeah, we can definitely wait until the issue is resolved. i just wanted to make sure @thenaomig was able to submit the recipe for the end of the workshop.
Thanks for the contribution, @thenaomig! We'll have this all working again shortly. 🙏
pre-commit.ci autofix
@cisaacstern, we are getting
TypeError: object of type FilePattern not serializable
do you happen to know why this is happening now (as far as i can tell, this issue wasn't there until we switched to a dict of recipes)
@cisaacstern, we are getting
TypeError: object of type FilePattern not serializable
do you happen to know why this is happening now (as far as i can tell, this issue wasn't there until we switched to a dict of recipes)
never mind.. @thenaomig found the problem.
/run cesm2_r11i1p1f1_ssp370
@andersy005 unfortunately job submission failed.
I've opened https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/220 to track down why.
thank you for the update, @cisaacstern! i'll take a look at the issue you link to see if i can help diagnose it.
The test failed, but I'm sure we can find out why!
Pangeo Forge maintainers are working diligently to provide public logs for contributors. That feature is not quite ready yet, however, so please reach out on this thread to a maintainer, and they'll help you diagnose the problem.
So there are now two concurrent issues going on here:
The last message from the pangeo-forge
app reporting a test run failure is the result of my manually deploying this job (as part of debugging problem 1). The backend logs show this error:
RuntimeError: botocore.exceptions.NoCredentialsError: Unable to locate credentials [while running 'Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/scan_file/Execute-ptransform-56']
"
which I believe is related to the fact that the data for this recipe are being pulled from an s3://
url.
Regarding problem 2, is there an HTTP endpoint for this data?
Regarding problem 2, is there an HTTP endpoint for this data?
Hmm, I don't suppose this helps?
Apparently our logs hosting service (Papertrail) is currently down? Can't catch a break! 🤦
I'll check why the CI synchronize
task is hanging as soon as it becomes available.
Apparently our logs hosting service (Papertrail) is currently down? Can't catch a break! 🤦
sorry @cisaacstern :(. thank you for the updates
If necessary, I would be happy to assist in debugging this tomorrow
Wow looks like https://github.com/pangeo-forge/pangeo-forge-orchestrator/pull/221 did solve the hanging CI.
I'll try re-triggering the test run of the recipe now. 🤞
/run cesm2_r11i1p1f1_ssp370
The last test run submission failed as well. I've done a bit of digging on this, and discovered that this recipe appears to be generating an unusually large Beam pipeline artifact.
Brief background: when a job is submitted to Dataflow, the recipe module is compiled to an Apache Beam pipeline object, which is then serialized (pickled) and uploaded (cached) to Google Cloud Storage (GCS). When Dataflow starts up, it grabs this serialized artifact from GCS, de-serializes (un-pickles) it, and uses it to start the pipeline.
Currently, we have around 150 serialized pipeline artifacts stored in GCS from recent Pangeo Forge recipe runs. The majority of these artifacts are in the range of 0.15-0.30 MB (150-300 KB).
The one job which has been run from this PR is the job which I mentioned having manually deployed during the course of debugging. This was the job associated with recipe run 1486. (That link doesn't make this fact too obvious, but you'll note that the Git SHA there is https://github.com/pangeo-forge/staged-recipes/pull/247/commits/34a4f3f9f38499ddbcd118a2b59bf4108d62d42a, which is part of this PR.)
The pipeline artifact for (the manually deployed) recipe run 1486 is 4.39 MB (I've removed the other x tick labels for clarity):
Though I can't say I know why the pipeline artifact is so large for this recipe, the fact that it is, may be a clue as to why this particular recipe is causing worker timeout / OOM conditions, which I've also just documented a bit further in https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/220#issuecomment-1399072193.
This is a test with one of many simulations from CMIP6 downscaled with WRF at UCLA.