pangeo-forge / staged-recipes

A place to submit pangeo-forge recipes before they become fully fledged pangeo-forge feedstocks
https://pangeo-forge.readthedocs.io/en/latest/
Apache License 2.0
38 stars 63 forks source link

Proposed Recipes for ISIMIP #158

Open larsbuntemeyer opened 2 years ago

larsbuntemeyer commented 2 years ago

Source Dataset

ISIMIP provides CMIP6 bias adjusted datasets.

Quote:

The Inter-Sectoral Impact Model Intercomparison Project (ISIMIP) offers a framework for consistently projecting the impacts of climate change across affected sectors and spatial scales. An international network of climate-impact modellers contribute to a comprehensive and consistent picture of the world under different climate-change scenarios.

Transformation / Alignment / Merging

The files should be concatenated along the time dimension. The structure is similar to ESGF, in fact, the data used to be available also in the ESGF but this has ended.

Output Dataset

zarr output format.

chuckwondo commented 2 years ago

@larsbuntemeyer, I've started to take a stab at this, but have some questions about the required dimensions.

First, I want to understand how many recipes make sense for the data:

  1. In looking at https://data.isimip.org/, it appears that you would want at least 3 recipes: (a) climate forcing, (b) socioeconomic forcing, and (c) static goeographic information. Are those indeed the only top-level collections?
  2. Within each of those 3, there are 4 simulation rounds. Do you want each of those simulations to be separate recipes, thus resulting in 3 x 4 = 12 recipes?

Further, in looking at the climate forcing data, it appears that the following dimensions might make sense:

  1. year range (e.g., 1901-1910)
  2. climate forcing (e.g., GSWP3-EWEMBI)
  3. climate variable (e.g., huss)

Please let me know if that makes sense, or if I'm off base (please know that I'm a noob to all of this, so I'm not familiar with all of the domain-specific terminology).

So far, not including "climate forcing" as a dimension (i.e., only year range and climate variable dimensions), I have this:

from typing import Tuple
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern, MergeDim

def make_url(variable: str, year_range: Tuple[int, int]):
    start_year, end_year = year_range
    template = (
        "https://files.isimip.org/ISIMIP2a/InputData/climate_co2/climate/HistObs/"
        "GSWP3-EWEMBI/{variable}_gswp3-ewembi_{start_year}_{end_year}.nc4"
    )

    return template.format(
        variable=variable,
        start_year=start_year,
        end_year=end_year
    )

year_ranges = (
    *((start_year, start_year + 9) for start_year in range(1901, 2011, 10)),
    (2011, 2016)
)
variables = (
    "huss",
    "pr",
    "ps",
    "rhs",
    "rlds",
    "rsds",
    "tas",
    "tasmax",
    "tasmin"
    "wind",
)

pattern = FilePattern(
    make_url,
    MergeDim(name="variable", keys=variables),
    ConcatDim(name="year_range", keys=year_ranges, nitems_per_file=10),
)

The urls of the items generated by the pattern look like this:

https://files.isimip.org/ISIMIP2a/InputData/climate_co2/climate/HistObs/GSWP3-EWEMBI/huss_gswp3-ewembi_1901_1910.nc4
https://files.isimip.org/ISIMIP2a/InputData/climate_co2/climate/HistObs/GSWP3-EWEMBI/huss_gswp3-ewembi_1911_1920.nc4
...
https://files.isimip.org/ISIMIP2a/InputData/climate_co2/climate/HistObs/GSWP3-EWEMBI/huss_gswp3-ewembi_2011_2016.nc4
https://files.isimip.org/ISIMIP2a/InputData/climate_co2/climate/HistObs/GSWP3-EWEMBI/pr_gswp3-ewembi_1901_1910.nc4
https://files.isimip.org/ISIMIP2a/InputData/climate_co2/climate/HistObs/GSWP3-EWEMBI/pr_gswp3-ewembi_1911_1920.nc4
...
https://files.isimip.org/ISIMIP2a/InputData/climate_co2/climate/HistObs/GSWP3-EWEMBI/tasminwind_gswp3-ewembi_2001_2010.nc4
https://files.isimip.org/ISIMIP2a/InputData/climate_co2/climate/HistObs/GSWP3-EWEMBI/tasminwind_gswp3-ewembi_2011_2016.nc4

I've managed to bend Ryan Abernathy's (@rabernat) ear about this during the ESIP Summer 2022 meeting, so I'm looking to get some traction while it's still fresh in my mind.

chuckwondo commented 2 years ago

@rabernat, each file contains 10 years, except for the last in each group, where each contains only 6 years (2011-2016). Does specifying nitems_per_file=10 cause a problem for those files that don't span 10 years?

larsbuntemeyer commented 1 year ago

@chuckwondo, thanks for picking this up so quickly. I haven't really been able to wrap my head around those questions yet, and i just wanted to drop that recipe idea here since we have some PhD things comping up and i wanted to avoid everybody starting to download those datasets, urgh... :weary: As you mentioned, it surely makes sense to split those recipes up, at least by the simulation round. For more details, i need first to have more experience with those datasets...

I think building up urls from dataset attributes makes totally sense for ISIMIP, but i also wanted to bring the search API to attention which might give some more control.