pangeo-forge / staged-recipes

A place to submit pangeo-forge recipes before they become fully fledged pangeo-forge feedstocks
https://pangeo-forge.readthedocs.io/en/latest/
Apache License 2.0
39 stars 63 forks source link

test out CMIP6 recipe with surface variables for SSP585 #47

Open dgergel opened 3 years ago

dgergel commented 3 years ago

Following up to our CMIP6-in-the-cloud collaboration meeting last week, wanted to include some specs that it would useful to test out the CMIP6 recipe with:

member_id: r1i1p1f1 if available, otherwise analogous ensemble member (e.g. r2i1p1f1) experiment_id: ssp585 variable_id: tasmax, tasmin, pr table_id: day activity_id: ScenarioMIP

Models: all available with the above specs

cc @cisaacstern @naomi-henderson

cisaacstern commented 3 years ago

Thanks @dgergel! I'll take a look at this before our next meeting.

cisaacstern commented 3 years ago

@dgergel, as a follow-up to our conversation yesterday, noting here the inputs available on S3 which match your specified criteria. https://github.com/pangeo-forge/cmip6-pipeline/pull/18 includes the utility class CMIPS3Search, which I've used to retrieve these matches.

The 18 matches on S3 are collectively 82 GB in size:

variables = ["tasmax", "tasmin", "pr"]
datasets = [f".ssp585.r1i1p1f1.day.{v}." for v in variables]

ssp585 = CMIPS3Search(datasets, variables)
ssp585.print_sizes()
Expand for input size details ``` GFDL-CM4.ssp585.r1i1p1f1.day.tasmax.gr1: 3.73 GB, 5 source files. GFDL-CM4.ssp585.r1i1p1f1.day.tasmax.gr2: 0.99 GB, 5 source files. GFDL-ESM4.ssp585.r1i1p1f1.day.tasmax.gr1: 3.73 GB, 5 source files. GFDL-CM4.ssp585.r1i1p1f1.day.tasmin.gr1: 3.76 GB, 5 source files. GFDL-CM4.ssp585.r1i1p1f1.day.tasmin.gr2: 1.0 GB, 5 source files. GFDL-ESM4.ssp585.r1i1p1f1.day.tasmin.gr1: 3.77 GB, 5 source files. CanESM5.ssp585.r1i1p1f1.day.pr.gn: 0.93 GB, 1 source files. EC-Earth3-Veg.ssp585.r1i1p1f1.day.pr.gr: 12.21 GB, 86 source files. IPSL-CM6A-LR.ssp585.r1i1p1f1.day.pr.gr: 1.84 GB, 1 source files. MPI-ESM1-2-LR.ssp585.r1i1p1f1.day.pr.gn: 1.6 GB, 5 source files. MRI-ESM2-0.ssp585.r1i1p1f1.day.pr.gn: 17.98 GB, 6 source files. CESM2-WACCM.ssp585.r1i1p1f1.day.pr.gn: 5.7 GB, 9 source files. CESM2.ssp585.r1i1p1f1.day.pr.gn: 5.71 GB, 9 source files. NorESM2-LM.ssp585.r1i1p1f1.day.pr.gn: 1.47 GB, 9 source files. NorESM2-MM.ssp585.r1i1p1f1.day.pr.gn: 5.66 GB, 9 source files. GFDL-CM4.ssp585.r1i1p1f1.day.pr.gr1: 5.3 GB, 5 source files. GFDL-CM4.ssp585.r1i1p1f1.day.pr.gr2: 1.39 GB, 5 source files. GFDL-ESM4.ssp585.r1i1p1f1.day.pr.gr1: 5.3 GB, 5 source files. total_size: 82.05 ```

The return_inputs method of CMIPS3Search instances returns a dictionary which maps the dataset's 6-tuple identifier to a list of its source urls. CMIPS3Search instances also have a tuples attribute which is a list of all the matching 6-tuple identifiers on S3:

inputs = ssp585.return_inputs()
inputs[ssp585.tuples[0]]
['s3://esgf-world/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-CM4/ssp585/r1i1p1f1/day/tasmax/gr1/v20180701/tasmax_day_GFDL-CM4_ssp585_r1i1p1f1_gr1_20150101-20341231.nc',
 's3://esgf-world/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-CM4/ssp585/r1i1p1f1/day/tasmax/gr1/v20180701/tasmax_day_GFDL-CM4_ssp585_r1i1p1f1_gr1_20350101-20541231.nc',
 's3://esgf-world/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-CM4/ssp585/r1i1p1f1/day/tasmax/gr1/v20180701/tasmax_day_GFDL-CM4_ssp585_r1i1p1f1_gr1_20550101-20741231.nc',
 's3://esgf-world/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-CM4/ssp585/r1i1p1f1/day/tasmax/gr1/v20180701/tasmax_day_GFDL-CM4_ssp585_r1i1p1f1_gr1_20750101-20941231.nc',
 's3://esgf-world/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-CM4/ssp585/r1i1p1f1/day/tasmax/gr1/v20180701/tasmax_day_GFDL-CM4_ssp585_r1i1p1f1_gr1_20950101-21001231.nc']

Once you link to your code for crawling the full ESGF catalog, I can incorporate those inputs into this evolving recipe as well.

dgergel commented 3 years ago

@cisaacstern this looks great. I just created a PR in pangeo-forge/cmip6-pipeline with my refactored code. I hadn't worked on this since winter, so some of it may be a bit out of date (I commented on this in the PR as well, but hoping @naomi-henderson can look over it to see if there are any functions that should be deprecated and replaced with newer ones).

I'm envisioning that the functions under cmip6-cloud/esgf.py could be used to create a similar CMIPESGFSearch utility class, perhaps with some flags if an ESGF node is known to be down (and thus shouldn't be searched). Or perhaps there could be a "priority" node and then other nodes attempted if that one is down. Naomi might have thoughts on this too.