pangeo-forge / staged-recipes

A place to submit pangeo-forge recipes before they become fully fledged pangeo-forge feedstocks
https://pangeo-forge.readthedocs.io/en/latest/
Apache License 2.0
39 stars 63 forks source link

Example pipeline for gridMET #4

Open jhamman opened 4 years ago

jhamman commented 4 years ago
## Source Dataset

gridMET is a dataset of 4km daily surface meteorological data covering the CONUS domain from 1979-yesterday.

Transformation / Alignment / Merging

Files should be concatenated along the time dimension and merged along the variable dimension

Output Dataset

1 Zarr store - chunks oriented for both time series and spatial analysis.

rabernat commented 3 years ago

This recipe requires us to resolve https://github.com/pangeo-forge/pangeo-forge/issues/50 and https://github.com/pangeo-forge/pangeo-forge/issues/39.

rabernat commented 3 years ago

This recipe is now ready to be implemented.

sharkinsspatial commented 3 years ago

I just began researching creating a gridMET recipe targeting the MS Planetary Computer bakery. Reviewing the Terraclimate example posed a few questions for me concerning pre-processing. Forgive my lack of experience in this area (I am decidedly not a scientist) but is there some documentation or literature which outlines the thresholds for invalid data used in the mask_opts? Is there an equivalent reference which should be used for gridMET pre-processing? cc @rabernat and @cisaacstern. Cheers. Alt Text

rabernat commented 3 years ago

That recipe comes from @jhamman. I'll let him answer the specific question.

But the general answer would be the following: this sort of bespoke quality control is the very definition of domain-specific expertise! Only folks who know the data intimately can set those sorts of parameters. And this is precisely the point of Pangeo Forge: to engage those scientists and get them to share their expertise with the community to produce a common pool of ARCO data. You @sharkinsspatial as an engineer are not expected to know those details, just like the domain scientist is not expected to know the details of kubernetes. The recipes will hopefully be full of these kinds of details which will make the data we produce more useful at the end of the pipeline.

Make sense?

rabernat commented 3 years ago

Is there an equivalent reference which should be used for gridMET pre-processing?

I did not answer this question in my previous comment... But basically you would need to get someone who really knows the data to answer it. Ideally data coming from data providers would be totally clean and can be copied as-is, without extensive preprocessing / cleaning steps. But that is often not the case. I can't speak to gridMET specifically.

sharkinsspatial commented 3 years ago

Thanks @rabernat 👍 . I'll try to follow up with @jhamman and see if he can add any insights as well as attempting to reach out to the community of gridMET users and elicit some recommendations.

sharkinsspatial commented 3 years ago

@jhamman It looks like you have done the majority of the early work on Terraclimate recipes https://github.com/pangeo-forge/terraclimate-feedstock-archive/blob/master/recipe/pipeline.py 🙇 . I haven't done deep investigation into the development methodology of gridMET and Terraclimate and just quickly noted the overlap in variable types.

If possible can you provide a bit of background on how you developed the pre and post processing techniques for Terraclimate cleaning and if you also have experience working with the gridMET data would you or @norlandrhagen have any interest in collaborating / assisting with recommending preprocessing approaches for that as we develop the recipe? Cheers.

norlandrhagen commented 3 years ago

Hey there @sharkinsspatial, definitely interested in working on the gridMET recipe. Can probably take a stab at it in the coming week.

sharkinsspatial commented 3 years ago

@norlandrhagen I have a rough example of a gridMET recipe I am testing which handles some of the file pattern irregularities. I'll try to make a PR to staged-recipes tomorrow so I can elicit some feedback. I still have some domain expert questions for @jhamman related to his masking functions from the Terraclimate example and if there are any related insights for the gridMET data.