jbusecke / pangeo-forge-esgf

Using queries to the ESGF API to generate urls and keyword arguments for receipe generation in pangeo-forge
Apache License 2.0
6 stars 4 forks source link

Refactor include replicas #1

Closed jbusecke closed 2 years ago

jbusecke commented 2 years ago

This PR implements another major refactor.

The main feature is that I now consider both replicas and original datasets, which requires a bunch of new filtering logic.

On the upside this will prefer datasets from preferred data nodes which leads to less errors during the recipe creation and generally better performance (less wait time for inactive nodes to time out etc).

Will test this on pangeo cloud and then merge when successful

jbusecke commented 2 years ago

I ran an example set of iids with this:

## List of instance ids to bring to process
# Note that the version is for now ignored (the latest is always chosen) TODO: See if we can make this specific to the version
import asyncio

from pangeo_forge_esgf import generate_recipe_inputs_from_iids

from pangeo_forge_recipes.patterns import pattern_from_file_sequence
from pangeo_forge_recipes.recipes import XarrayZarrRecipe

iids = [
    'CMIP6.PMIP.MIROC.MIROC-ES2L.past1000.r1i1p1f2.Amon.tas.gn.v20200318',
    'CMIP6.PMIP.MRI.MRI-ESM2-0.past1000.r1i1p1f1.Amon.tas.gn.v20200120',
    'CMIP6.PMIP.MPI-M.MPI-ESM1-2-LR.past2k.r1i1p1f1.Amon.tas.gn.v20210714',
    'CMIP6.CMIP.NCC.NorESM2-LM.historical.r1i1p1f1.Omon.vmo.gr.v20190815',
    'CMIP6.PMIP.MIROC.MIROC-ES2L.past1000.r1i1p1f2.Amon.tas.gn.v20200318',
    'CMIP6.PMIP.MRI.MRI-ESM2-0.past1000.r1i1p1f1.Amon.tas.gn.v20200120',
    'CMIP6.PMIP.MPI-M.MPI-ESM1-2-LR.past2k.r1i1p1f1.Amon.tas.gn.v20210714',
    'CMIP6.CMIP.FIO-QLNM.FIO-ESM-2-0.piControl.r1i1p1f1.Omon.vsf.gn',# this one should not be available. This changes daily. Check the data nodes which are down to find examples.
]

# recipe_inputs = asyncio.run(generate_recipe_inputs_from_iids(iids))
recipe_inputs = await generate_recipe_inputs_from_iids(iids)

recipes = {}

for iid,recipe_input in recipe_inputs.items():
    urls = recipe_input.get("urls", None)
    pattern_kwargs = recipe_input.get("pattern_kwargs", {})
    recipe_kwargs = recipe_input.get("recipe_kwargs", {})

    pattern = pattern_from_file_sequence(urls, "time", **pattern_kwargs)
    if urls is not None:
        recipes[iid] = XarrayZarrRecipe(
            pattern, xarray_concat_kwargs={"join": "exact"}, **recipe_kwargs
        )
print('+++Failed iids+++')
print(list(set(iids)-set(recipes.keys())))
print('+++Successful iids+++')
print(list(recipes.keys()))

and this

from pangeo_forge_recipes.recipes import setup_logging
setup_logging('DEBUG')
for iid, recipe in recipes.items():
    print('\n\n\n|||||||||||| ',iid)
    recipe.copy_pruned().to_function()()

and this seems to work really nicely! Ill merge this and issue a new bugfix version (in hopes these make it onto the pangeo-forge docker image cc @cisaacstern)