pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.
https://pangeo-forge.readthedocs.io/
Apache License 2.0
126 stars 54 forks source link

MUR SST Kerchunk recipe #647

Open abarciauskas-bgse opened 1 year ago

sbquinlan commented 12 months ago

This issue should probably be on pangeo-forge/staged-recipes, but I've been working on a new branch this week for what I thought would be a relatively straightforward change to the existing MUR SST recipe branch and found that the lat and lon dimensions of the source dataset aren't working with Kerchunk.

Current recipe:

identical_dims = ['lat', 'lon']
selected_vars = ['analysed_sst', 'analysis_error', 'mask', 'sea_ice_fraction']
recipe = (
    beam.Create(pattern.items())
    | OpenWithKerchunk(
        remote_protocol='s3' if selected_rel == S3_REL else 'https',
        file_type=pattern.file_type,
        # inline_threshold=10000,
        storage_options=pattern.fsspec_open_kwargs
    )
    | FilterVars(
        keep={*pattern.concat_dims, *identical_dims, *selected_vars}
    )
    | WriteCombinedReference(
        store_name='mursst',
        concat_dims=pattern.concat_dims,
        identical_dims=identical_dims,
    )
)

The key line here is inline_threshold=10000 I found that the lat and lon dataarrays are about 5kb. So if I pick an inline_threshold larger than that, kerchunk copies the dataarrays into the references.

Without inline:

>>> ds
<xarray.Dataset>
Dimensions:           (time: 2, lat: 17999, lon: 36000)
Coordinates:
  * lat               (lat) float32 nan nan nan nan nan ... nan nan nan nan nan
  * lon               (lon) float32 nan nan nan nan nan ... nan nan nan nan nan
  * time              (time) datetime64[ns] 2002-06-01T09:00:00 2002-06-02T09...
Data variables:
    analysed_sst      (time, lat, lon) float32 ...
    analysis_error    (time, lat, lon) float32 ...
    mask              (time, lat, lon) float32 ...
    sea_ice_fraction  (time, lat, lon) float32 ...
Attributes: (12/47)
    Conventions:                CF-1.5
    Metadata_Conventions:       Unidata Observation Dataset v1.0
    acknowledgment:             Please acknowledge the use of these data with...
    cdm_data_type:              grid
    comment:                    MUR = "Multi-scale Ultra-high Reolution"
    creator_email:              ghrsst@podaac.jpl.nasa.gov
    ...                         ...
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    time_coverage_end:          20020602T210000Z
    time_coverage_start:        20020601T210000Z
    title:                      Daily MUR SST, Final product
    uuid:                       27665bc0-d5fc-11e1-9b23-0800200c9a66
    westernmost_longitude:      -180.0

With inline:

>>> ds
<xarray.Dataset>
Dimensions:           (time: 2, lat: 17999, lon: 36000)
Coordinates:
  * lat               (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99
  * lon               (lon) float32 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0
  * time              (time) datetime64[ns] 2002-06-01T09:00:00 2002-06-02T09...
Data variables:
    analysed_sst      (time, lat, lon) float32 ...
    analysis_error    (time, lat, lon) float32 ...
    mask              (time, lat, lon) float32 ...
    sea_ice_fraction  (time, lat, lon) float32 ...
Attributes: (12/47)
    Conventions:                CF-1.5
    Metadata_Conventions:       Unidata Observation Dataset v1.0
    acknowledgment:             Please acknowledge the use of these data with...
    cdm_data_type:              grid
    comment:                    MUR = "Multi-scale Ultra-high Reolution"
    creator_email:              ghrsst@podaac.jpl.nasa.gov
    ...                         ...
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    time_coverage_end:          20020601T210000Z
    time_coverage_start:        20020531T210000Z
    title:                      Daily MUR SST, Final product
    uuid:                       27665bc0-d5fc-11e1-9b23-0800200c9a66
    westernmost_longitude:      -180.0

I'm not entirely sure what the best practice is to handle this scenario or how to properly detect that identical_dims are not in the desired format, but I'll ask around.

Separately, the resulting metadata isn't consolidated which I also don't entirely understand and maybe related. I'm working through this and hope to resolve this next week.

sbquinlan commented 12 months ago

I was doing it wrong, forgot that I needed to forward along the storage_options to remote_options when creating a reference fs.

sbquinlan commented 12 months ago

https://github.com/pangeo-forge/staged-recipes/pull/259

abarciauskas-bgse commented 10 months ago

I was able to run this recipe locally with the scripts/changes/tests outlined in this gist: https://gist.github.com/abarciauskas-bgse/cff4c77ea841601773d087bc6b45122b

A few questions / next steps:

ranchodeluxe commented 10 months ago

I was able to run this recipe locally with the scripts/changes/tests outlined in this gist: https://gist.github.com/abarciauskas-bgse/cff4c77ea841601773d087bc6b45122b

Nice 🥳 Can you run it locally full on without --prune? Have you tried that? That's probably the next step to proceed the ones below. But I imagine we can make a new PR with the changes from the gist

A few questions / next steps:

* [ ]  I see that a number of transforms are defined in https://github.com/pangeo-forge/staged-recipes/blob/master/recipes/mursst/recipe.py. Should those be added to pangeo-forge-recipes transforms?

I think all these custom ones have already been merged or are about to by Raphael. Let me review what needs to change and comment/suggest on your new PR when you have it up

* [ ]  I think we need to convert to using a file pattern rather than relying on GranuleQuery, although I'm not sure it's absolutely necessary - @ranchodeluxe have you experienced failures due to using GranuleQuery rather than a file pattern?

I think forcing the lookup to happen on the workers instead of the host is ideal yes

* [ ]  Run on https://github.com/NASA-IMPACT/veda-pforge-job-runner, perhaps replacing or updating [Failing: MUR SST NASA-IMPACT/veda-pforge-job-runner#15](https://github.com/NASA-IMPACT/veda-pforge-job-runner/issues/15) and the linked code

Yeah, let me run this later today as a prune=True and prune=False to see what happens