pangeo-forge / staged-recipes

A place to submit pangeo-forge recipes before they become fully fledged pangeo-forge feedstocks
https://pangeo-forge.readthedocs.io/en/latest/
Apache License 2.0
39 stars 63 forks source link

Cmip6 wrf wus #247

Open thenaomig opened 1 year ago

thenaomig commented 1 year ago

This is a test with one of many simulations from CMIP6 downscaled with WRF at UCLA.

andersy005 commented 1 year ago

@thenaomig and I have been working on this recipe at the post AMS Pangeo meeting :)

andersy005 commented 1 year ago

pre-commit.ci autofix

andersy005 commented 1 year ago

pre-commit.ci autofix

andersy005 commented 1 year ago

/run cmip6-wrf-wus

cisaacstern commented 1 year ago

@andersy005 unfortunately, the backend service is currently broken, following my failed attempt to upgrade the pangeo-forge-recipes version used there. I am working on getting it fixed and will ping you here once that's the case. (Currently, jobs will submit, but they will fail because of version mismatching between the backend service client and Dataflow workers.)

andersy005 commented 1 year ago

thank you for the heads up, @cisaacstern! yeah, we can definitely wait until the issue is resolved. i just wanted to make sure @thenaomig was able to submit the recipe for the end of the workshop.

cisaacstern commented 1 year ago

Thanks for the contribution, @thenaomig! We'll have this all working again shortly. 🙏

andersy005 commented 1 year ago

pre-commit.ci autofix

andersy005 commented 1 year ago

@cisaacstern, we are getting

TypeError: object of type FilePattern not serializable

do you happen to know why this is happening now (as far as i can tell, this issue wasn't there until we switched to a dict of recipes)

andersy005 commented 1 year ago

@cisaacstern, we are getting

TypeError: object of type FilePattern not serializable

do you happen to know why this is happening now (as far as i can tell, this issue wasn't there until we switched to a dict of recipes)

never mind.. @thenaomig found the problem.

andersy005 commented 1 year ago

/run cesm2_r11i1p1f1_ssp370

cisaacstern commented 1 year ago

@andersy005 unfortunately job submission failed.

I've opened https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/220 to track down why.

andersy005 commented 1 year ago

thank you for the update, @cisaacstern! i'll take a look at the issue you link to see if i can help diagnose it.

pangeo-forge[bot] commented 1 year ago

The test failed, but I'm sure we can find out why!

Pangeo Forge maintainers are working diligently to provide public logs for contributors. That feature is not quite ready yet, however, so please reach out on this thread to a maintainer, and they'll help you diagnose the problem.

cisaacstern commented 1 year ago

So there are now two concurrent issues going on here:

  1. The production deployment still appears to have a bug related to job submission, as discussed in https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/220.
  2. The last message from the pangeo-forge app reporting a test run failure is the result of my manually deploying this job (as part of debugging problem 1). The backend logs show this error:

    RuntimeError: botocore.exceptions.NoCredentialsError: Unable to locate credentials [while running 'Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/scan_file/Execute-ptransform-56']
    "

    which I believe is related to the fact that the data for this recipe are being pulled from an s3:// url.

Regarding problem 2, is there an HTTP endpoint for this data?

thenaomig commented 1 year ago

Regarding problem 2, is there an HTTP endpoint for this data?

Hmm, I don't suppose this helps?

cisaacstern commented 1 year ago

Apparently our logs hosting service (Papertrail) is currently down? Can't catch a break! 🤦

I'll check why the CI synchronize task is hanging as soon as it becomes available.

andersy005 commented 1 year ago

Apparently our logs hosting service (Papertrail) is currently down? Can't catch a break! 🤦

sorry @cisaacstern :(. thank you for the updates

If necessary, I would be happy to assist in debugging this tomorrow

cisaacstern commented 1 year ago

Wow looks like https://github.com/pangeo-forge/pangeo-forge-orchestrator/pull/221 did solve the hanging CI.

I'll try re-triggering the test run of the recipe now. 🤞

cisaacstern commented 1 year ago

/run cesm2_r11i1p1f1_ssp370

cisaacstern commented 1 year ago

The last test run submission failed as well. I've done a bit of digging on this, and discovered that this recipe appears to be generating an unusually large Beam pipeline artifact.

Brief background: when a job is submitted to Dataflow, the recipe module is compiled to an Apache Beam pipeline object, which is then serialized (pickled) and uploaded (cached) to Google Cloud Storage (GCS). When Dataflow starts up, it grabs this serialized artifact from GCS, de-serializes (un-pickles) it, and uses it to start the pipeline.

Currently, we have around 150 serialized pipeline artifacts stored in GCS from recent Pangeo Forge recipe runs. The majority of these artifacts are in the range of 0.15-0.30 MB (150-300 KB).

The one job which has been run from this PR is the job which I mentioned having manually deployed during the course of debugging. This was the job associated with recipe run 1486. (That link doesn't make this fact too obvious, but you'll note that the Git SHA there is https://github.com/pangeo-forge/staged-recipes/pull/247/commits/34a4f3f9f38499ddbcd118a2b59bf4108d62d42a, which is part of this PR.)

The pipeline artifact for (the manually deployed) recipe run 1486 is 4.39 MB (I've removed the other x tick labels for clarity):

Screen Shot 2023-01-20 at 2 08 32 PM

Though I can't say I know why the pipeline artifact is so large for this recipe, the fact that it is, may be a clue as to why this particular recipe is causing worker timeout / OOM conditions, which I've also just documented a bit further in https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/220#issuecomment-1399072193.