pangeo-forge / gpcp-feedstock

A Pangeo Forge Feedstock for gpcp.
Apache License 2.0
3 stars 2 forks source link

Tracking first production deployment on dataflow #4

Closed cisaacstern closed 2 years ago

cisaacstern commented 2 years ago

Following the (probable) scheduler memory problem identified in #2, a test of this recipe on Dataflow deployed from #3 succeeded. A production run of the recipe is now running on Dataflow. For those with access to the Pangeo Forge GCP console, its progress can be tracked here.

Note that the recipe run linked from the in progress deployment on this feedstock's deployment page is actually incorrect. This link points to recipe run 58 in the production database, but because we are using the experimental Dataflow feature, the actual recipe run for this production deployment is in the staging database here: https://api-staging.pangeo-forge.org/recipe_runs/58.

cc @rabernat

cisaacstern commented 2 years ago

This production recipe succeeded (and in only 16min 22sec!) on Dataflow. 🎈 The dataset public url is available via https://api-staging.pangeo-forge.org/recipe_runs/58, but not listed on pangeo-forge.org, because this was done on the experimental Dataflow backend, which is setup to write recipe run metadata out to the staging API. Here's the dataset repr:

Note: An earlier version of this comment had the wrong dataset url below. It's now corrected.


import xarray as xr

ds = xr.open_dataset( "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr", engine="zarr", chunks={}, ) print(ds)

> **Note**: An earlier version of this comment had the wrong repr! It's now corrected. 
Dimensions: (latitude: 180, nv: 2, longitude: 360, time: 9226) Coordinates: lat_bounds (latitude, nv) float32 dask.array * latitude (latitude) float32 -90.0 -89.0 -88.0 -87.0 ... 87.0 88.0 89.0 lon_bounds (longitude, nv) float32 dask.array * longitude (longitude) float32 0.0 1.0 2.0 3.0 ... 356.0 357.0 358.0 359.0 * time (time) datetime64[ns] 1996-10-01 1996-10-02 ... 2021-12-31 time_bounds (time, nv) datetime64[ns] dask.array Dimensions without coordinates: nv Data variables: precip (time, latitude, longitude) float32 dask.array Attributes: (12/45) Conventions: CF-1.6, ACDD 1.3 Metadata_Conventions: CF-1.6, Unidata Dataset Discovery v1.0, NOAA ... acknowledgment: This project was supported in part by a grant... cdm_data_type: Grid cdr_program: NOAA Climate Data Record Program for satellit... cdr_variable: precipitation ... ... standard_name_vocabulary: CF Standard Name Table (v41, 22 February 2017) summary: Global Precipitation Climatology Project (GPC... time_coverage_duration: P1D time_coverage_end: 1996-10-01T23:59:59Z time_coverage_start: 1996-10-01T00:00:00Z title: Global Precipitation Climatatology Project (G... ```
cisaacstern commented 2 years ago

@rabernat, following up on our offline chat:

  1. This successful production run was initially recorded in Pangeo Forge's staging API (because it was running on the staging/dev backend). I've manually copied that recipe run over into our production API, which means that the successfully completed dataset is now discoverable here: https://pangeo-forge.org/dashboard/feedstock/42
  2. As noted above, I discovered a bug in the url registrar provides to the GitHub deployments API: it doesn't differentiate between staging and production APIs. I've manually patched that problem for this recipe run, so the Deployed link for the successful production run on the following page now points to the right place: https://github.com/pangeo-forge/gpcp-feedstock/deployments/activity_log?environment=production

With these two changes, this successful production run on Dataflow should now be indistinguishable, from an outside user perspective, from a recipe run on our default backend. I believe this wraps up the usefulness of this issue so will close now.

P.S. - @alxmrs, this is an example of Dataflow already shining 🌞 in production for us, using your existing to_beam compiler!