pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.
https://pangeo-forge.readthedocs.io/
Apache License 2.0
126 stars 54 forks source link

Support for COG outputs/provide functionality for conversion to COG #63

Open ciaransweet opened 3 years ago

ciaransweet commented 3 years ago

We have a few use-cases where it would be nice to have support/stuff available for the creation of COGs through recipes and pangeo-forge.

Whilst formats like Zarr are great for analysis, our use-cases also often involve displaying global-sized raster datasets. Having first-class citizen support for COG (Cloud Optimised GeoTiffs) would be awesome!

For example there is already the recipe NetCDFtoZarrSequentialRecipe, it would be handy to have something like NetCDFtoCOGRecipe... whether this takes every variable in a NetCDF and converts it to a COG or allows for the required variables to be selected.

I've done some playing around with 'generic' conversion of NetCDF to COG, using rasterio, xarray, rioxarray, and rio_cogeo, if useful I can post some of the functions here, though they're pretty 'experimental'.

Ideally such functionality would enable some parameterisation of the conversion (CRS, nodata, compression type etc.).


Sorry for the 🧠 dump! Happy to flesh out ideas and try to answer questions! I just know we'd love to make some COGs en-masse 😅

rabernat commented 3 years ago

Thanks a lot for the issue @ciaranevans! I think we are all in agreement about the need to support COG as an output format.

One general question I have about a recipe is whether there is always a 1:1 mapping between a set of inputs and an output dataset? Or could we have a recipe that reads a set of NetCDF file and produces BOTH a Zarr and a set of COGs? I think the long term answer is yes, but for the short term, I don't it's an important feature to support right now.

Instead let's focus on the 1:1 case for now and add COGs as a supported output.

The question of how to implement this raises some software design questions for which we would welcome your input. As discussed in #27, our idea is to slowly refactor the monster NetCDFtoZarrSequentialRecipe into a series of classes and mixins that can be re-composed to easily define new recipe classes. (In fact, this is how we originally tried to set up the code, but we abandoned it because it just felt like premature complexification.) For example, we could imagine refactoring to define this class as

class NetCDFtoZarrSequentialRecipe(SequentialRecipe, NetCDFInputMixin, ZarrOutputMixin):
    pass

and then simply being able to write

class NetCDFtoCOGSequentialRecipe(SequentialRecipe, NetCDFInputMixin, COGOutputMixin):
    pass

We would welcome a PR from you and the DevSeed team to implement COG output support following this sort of pattern.

davidbrochart commented 3 years ago

One general question I have about a recipe is whether there is always a 1:1 mapping between a set of inputs and an output dataset? Or could we have a recipe that reads a set of NetCDF file and produces BOTH a Zarr and a set of COGs? I think the long term answer is yes, but for the short term, I don't it's an important feature to support right now.

I also think that in the long term we want to support multiple outputs, just like in conda-forge we have recipes that produce e.g. a shared library and a static library, or a binary library (for using in C) and its Python bindings. Not only does it save resources (the download of the original files, the chunking...), it also ensures that all the outputs originate from the same reference.