Support crawling of existing data catalogs and automatic generation of FilePatterns

rabernat commented 2 years ago

There are many existing data catalogs out there. We currently require users to create a FilePattern from either a list of URLs or a formatting function and a set of keys. However, if the data are already in a catalog, these steps should be unnecessary. Instead we should be able to generate a file pattern directly from a simple query (e.g. dataset_id="NOAA_GPCP", version=3.0 etc.)

Examples of catalogs formats we might want to crawl are:

Here are some different ways we could achieve this:

Bespoke code in each recipe

This is possible today. You can write code to crawl anything you want, build a list of files, and then call pattern_from_file_sequence. This is what I do in the GPCP recipe.

Pros: simple and flexible Cons: hard to scale, lots of redundant code, only supports 1D FilePatterns

Functions within pangeo forge recipes package

We could imagine creating some class methods on FilePattern that enable code like this

pattern = FilePattern.from_CMR(**query)

Pros: tightly integrated with pangeo forge Cons: potentially grows the scope of pangeo forge recipes a lot with lots of messy, format-specific code

Plugin architecture

Or instead, we could use some sort of plugin architecture that allows third party packages to provide file-pattern constructors. Then the logic for each weird catalog format could live in a standalone repo, to be maintained by people who understand that format, while integrating tightly with pangeo forge

Some different plugin approaches we could use

Entry points
Imitate the xarray accessor framework
Traitlets

Pros: Clean separation of custom logic into separate repos, support the creation of private, org-specific plugins Cons: More complex software engineering, potential challenges with testing

cc @briannapagan, who inspired this idea from her work with NASA CMR

martindurant commented 2 years ago

I know a package which implements a plugin system designed to make various catalogue providers appear under a consistent API. It even has a plugins system for catalogue types and individual entries.

https://intake.readthedocs.io/en/latest/

By the way, as previously trailed (and unofficial), Anaconda is finally getting behind Intake and will be using its spec as the basis for dataset/catalogue exchange on anaconda.cloud . This work is scheduled for Q4.

rabernat commented 2 years ago

Yes of course intake, thanks for the reminder.

jbusecke commented 2 years ago

@cisaacstern encouraged me to chime in for this issue. I think that a use-case I am particularly interestested in might also fit into the scope of a plugin.

I have been working for a while now on migrating the pangeo CMIP6 cloud holdings to a less manual labor intensive workflow using pangeo forge.

The basic idea is to generate a large dictionary of recipes, one for each dataset (itself combined out of possibly several files).

The challenges for these particular datasets are twofold:

I need to extract a set of urls from the ESGF API given a unique identifier (instance_id). This seems very similar to what is described above.
I need to dynamically determine certain keyword arguments like the number of output chunks (aiming to maintain a similar chunksize for datasets that might have vastly different resolution), and detect the netcdf version of certain files.

I have some initial solutions for both of these issues implemented as 'a ton of extra logic' in this feedstock, but as mentioned above this is somewhat cumbersome to maintain.

Given the scale of the CMIP6 archive it seems likely that we will eventually have to split it into several feedstocks. Having custom code duplicated across many feedstocks/recipes is not ideal.

I have started to refactor some of the logic out into a stand-alone package pangeo-forge-esgf, but this external dependency currently blocks execution on pangeo-forge cloud.

I think that 1. above could be a very nice test case for a plug-in architecture?

But even beyond that, case 2. might be another slightly different and interesting use case. I am currently deriving all of the keyword arguments based on many range-requests and imprecise size estimates before creating the recipe. As discussed here this could actually be done much more precisely and quickly when the data has been cached already. So I guess my question ultimately is, if the proposed plug-in structure could be general enough to 'attach' during different stages of the recipe

I am very keen to help anywhere I can to drive this effort forward, since it seems it might unblock my CMIP6 efforts along the way.

cisaacstern commented 2 years ago

A few notes re: plugins from @jbusecke and my chat this morning.

For generating patterns based on ESGF queries, we thought it would be nice to be able to call FilePatterns something like this:

from pangeo_forge_recipes.patterns import FilePattern

esgf_instance_id_with_wildcards = "CMIP6.PMIP.*.*.lgm.*.*.uo.*.*"
pattern = FilePattern(esgf_instance_id_with_wildcards, plugin="esgf")

...so the ESGF plugin overloads FilePattern with its own plugin-specific signature, to allow construction of a pattern as is currently implemented in https://github.com/jbusecke/pangeo-forge-esgf.

Then, following on Julius's mention of plugin-specific recipe kwargs, it would be great to be able to do something like:

# recipe `plugin` could be passed explicitly, or inferred from `pattern.plugin`
recipe = XarrayZarrRecipe(pattern, plugin="esgf")

At the XarrayZarrRecipe level, we imagined the plugin could potentially overwrite stages of the default recipe pipeline with plugin-specific stages. With default transforms referenced from https://github.com/pangeo-forge/pangeo-forge-recipes/issues/376, in pseudocode:

from pangeo_forge_recipes.plugins import registered_plugins

default_transforms = {
    "open_with_fsspec": OpenWithFSSpec,
    "open_with_xarray": OpenWithXarray,
    "infer_xarray_schema": InferXarraySchema,
    "prepare_zarr_target": PrepareZarrTarget,
    ...
}

@dataclass
class XarrayZarrRecipe:

    file_pattern_source: FilePatternSource
    plugin: Optional[str] = None

    def __post_init__(self):
        if self.plugin and self.plugin not in registered_plugins:
              raise ValueError(f"Plugin '{self.plugin}' specified but not installed")

    def to_beam(self):
        transforms = default_transforms.copy(deep=True)
        if self.plugin:
            transforms = {
                # `registered_plugins[self.plugin]` would be a dict in which the plugin optionally
                # defines overrides for any of the default transforms. here, we apply any overrides
                # the plugin has defined.
                k: (registered_plugins[self.plugin][k] if k in registered_plugins[self.plugin] else v)
                for k, v in transforms.items()
            }
        chained_transform = (
            self.file_pattern_source
            | transforms["open_with_fsspec"]
            | transforms["open_with_xarray"]
            | transforms["infer_xarray_schema"]
            | transforms["prepare_zarr_target"]
            ...
        )
        return chained_transform

cisaacstern commented 2 years ago

Also cc'ing @yuvipanda & @sharkinsspatial who have interest + expertise here and looks like haven't been tagged yet.

cisaacstern commented 2 years ago

Functions within pangeo forge recipes package

We could imagine creating some class methods on FilePattern that enable code like this
pattern = FilePattern.from_CMR(**query)

IMO this class method approach has a nicer UI than overloading FilePattern (as I suggested above). I agree that it's impractical to maintain these methods in pangeo-forge-recipes, but I believe it's possible to have a plugin register them on the class.

yuvipanda commented 2 years ago

Had a quick call with @cisaacstern, @briannapagan, @jbusecke and me today to discuss this. We decided on a very specific solution to a specific problem here. I'm going to use CMR as the example here, but should apply for other catalogs too.

Someone writing a recipe for a dataset that is coming out of CMR should be able to use their existing mental model of how CMR works and use just that to write the recipe. The easiest way to do that is to make a package like pangeo-forge-recipes-cmr that lets users specify CMR related properties in their recipe.py file, and have that package be a wrapper around pangeo-forge-recipes so it produces a pangeo_forge_recipe at the end. For example, the recipe.py file could look like this:

from pangeo_forge_recipes_cmr import CMRRecipe

recipe = CMRRecipe(short_name="GPM_3IMERGHHL") # pass additional params here if needed

And it's the responsibility of the CMRRecipe object to translate and make sure this actually provides a pangeo_forge_recipes Recipe object.

This has several advantages:

This is super simple, as it's a traditional wrapper library
It requires 0 changes to the pangeo_forge_recipes
It means the API of the recipes wrapper package can be tuned specifically to match what that API needs, without complicating the pangeo_forge_recipes code.

I think there was general agreement that we needed some sort of plugin API as well, but this would already cover a lot of use cases with minimal fuss in a long-term sustainable way.

The only feature really missing here is the ability to install arbitrary packages for use by recipe.py. Thanks to pangeo-forge-runner, we can already have multiple .py files in feedstock/ - recipe.py is now executed as a normal python file, so you can have additional python files there and import stuff from them. We'll need to add functionality to pangeo-forge-runner to allow installation of arbitrary extra packages only for recipe parse time (not execution time, as that's a lot more complex). I think this is a useful feature we can easily add.

With the end-of-September demo in mind, the next action items we decided on are:

Figure out how to give pangeo-forge-recipes a list of files than a pattern right now (@briannapagan)
Write docs on how you can test a recipe locally exactly in the same way it'll be run (@yuvipanda)
Figure out the API for use in recipes.py that would be nicely demoable to an audience of people who know CMR but not pangeo-forge (@briannapagan @yuvipanda)
Add feature to pangeo-forge-runner to install arbitrary packages, and provide an allow_list so we actually restrict them for now (until more isolation features land in the orchestrator) (@yuvipanda)

Me and @briannapagan have a meeting scheduled for Monday at 2pm pacific to move forward here.

@jbusecke @cisaacstern what can we do re: CMIP6 here?

I'm also sure I missed some points of the discussion, others feel free to chime in.

yuvipanda commented 2 years ago

I just want to reiterate that we haven't discounted any plugin systems, just that the one feature we need for plugins (arbitrary extra packages at parse time) already unlocks something that will solve many use cases (regular wrapper libraries), so pursuing that first.

cisaacstern commented 2 years ago

Agree 💯 with this path @yuvipanda, thanks for proposing it, and summarizing it so clearly.

Re: cmip6 use cases, this wrapper approach will be plug-and-play with https://github.com/jbusecke/pangeo-forge-esgf. 👍

Once the beam refactor is merged, this would even allow us to start experimenting with the sort of custom pipeline definitions I was brainstorming about in https://github.com/pangeo-forge/pangeo-forge-recipes/issues/410#issuecomment-1242261173: the wrapper package could simply compose those custom pipelines itself.

Looking forward to seeing this in action! Please let me know if/how/when I can help.

pangeo-forge / pangeo-forge-recipes