pangeo-data / bids2023_codesprint

Repository for the joint OSGEO and Pangeo code sprint at ESA BIDS in November 2023
https://wiki.osgeo.org/wiki/OSGeo_Community_Sprint_2023
MIT License
3 stars 1 forks source link

xarray-dggs package? #3

Open benbovy opened 1 year ago

benbovy commented 1 year ago

Cross-posting here what I've suggested in the pangeo discource thread.

Xarray-DGGS

I think that a good and reasonable goal for the sprint would be to come up with an xarray-dggs package that would provide an xarray-compatible interface to various DGGS features exposed in 3rd-party Python libraries (e.g., healpy, pys2index, spherely, h3-py, dggrid4py, etc.) through a very basic set of features:

  1. A few Xarray custom indexes that could be built from lat/lon data (or directly from DGGS cell indices) and that would enable data selection using .sel()
  2. Xarray Dataset and/or DataArray accessors for DGGS-specific API (set new DGGS Xarray indexes from lat/lon coordinates, get DGGS cell indices as a new coordinate, etc.)

I think that DGGS grids have enough in common to expose the functionality for all of them in a common xarray-dggs package, maybe with optional dependencies for each backend (healpy, pys2index, h3-python, DGGRID, etc.).

This proposal builds on top of a few suggestions found in the README of this repository, e.g., H3 or rHEALPIx + Xoak + Xarray, H3 or rHEALPIx + Xoak + Xarray + Xvec?. While both xoak and xvec can be good sources of inspiration for xarray-dggs, those packages have slightly different scopes: Xoak provides generic tree-based indexes (not only geospatial) and Xvec currently works only with shapely (planar geometries). Xoak has a nice API for nearest-neighbors point-wise indexing that leverages Xarray advanced indexing (i.e., using xarray.DataArray objects) but it still has to be refactored so it builds on top of Xarray custom indexes. Xvec is one of the few (the only?) released Xarray extensions that provide an Xarray custom index.

The sub-topics and (open) questions listed below are not exhaustive. Please feel free to suggest in the comments below any important topic or question that is missing.

Data model

An Xarray index must relate to one or more coordinates with arbitrary dimensions. In the case of DGGS, what should be the coordinates and their dimension(s)?

Do we need to have a fixed data model for all DGGS? It can be flexible, i.e., an Xarray Index subclass may support different data models (build options, flexible inputs).

Should we restrict the index and/or coordinates to a fixed level / zoom / resolution of the discrete global grid?

I guess we need some sort of CRS and/or additional metadata for certain kinds of grids (custom parameters)? Some grid parameters could perhaps be hidden as internal attributes of the index?

Data selection API (.sel)

There are a lot of possibilities regarding how to select data on a discrete global grid. What kind of indexer object(s) could we pass to xarray .sel()?

How to detect the kind of indexer? We could look at the type of the indexers (scalar, slice, list, array, custom object), the value type, etc. Note: currently it is not possible to pass custom options to .sel https://github.com/pydata/xarray/issues/7099.

Assessing the capabilities of the DGGS Python libraries

There are some important requirements for reusing those libraries efficiently with Xarray:

Perhaps not all libraries mentioned above have those requirements. Which ones should we focus our efforts? Which kinds of data selection listed above should we focus on considering a common set of core features available in all libraries?

tinaok commented 1 year ago

Hello, @keewis is trying to put our efforts we made for our IAOCEA project related with healpix integration here. https://github.com/IAOCEA/xarray-healpy

We will try to update some example notebook with real data projection before Monday.

Note that we are not implementing rhealpix but healpix itself through healpy package.

Our final objective is that using property of Xarray-DGGS, we can

benbovy commented 1 year ago

That looks great @tinaok and @keewis!

Your objectives look already quite specific and "high-level". I wonder if during the sprint it would be best to first discuss about

  1. everyone's use-cases / user stories with DGGS
  2. see how we can break them down into smaller, generic tasks
  3. look at each grid (implementation available in Python) if those tasks are supported
  4. see if/how those tasks may be easily implemented using the Xarray API (.sel, etc. possibly with a dggs Xarray index) or if they would require custom API in an Xarray accessor.

before getting our hands dirty into the code.

(3-4 are more specific to Python/Xarray but 1-2 may be interesting for anyone)

This might better structure the sprint and this would greatly help in having a better idea on whether an xarray-dggs extension (or any other package) makes sense for supporting common tasks across different global grids (healpix, s2, h3, etc.). At least for me as I don't have much experience in using DGGS for practical applications :)

rabernat commented 1 year ago

I'm excited to participate in a sprint on this topic!

tinaok commented 1 year ago

@benbovy I am happy to share our use case through the example we just added.

I can show how we convert data, and challenges we have today. With the same notebook, I can show a same model data with 2 different resolution. Which we hope to somehow 'connect' them using DGGS convention.

I'm also very much interested learning by DGGS specialist @allixender (?) how DGGS is used for routing.

If anyone from EERIE project or nextGEMS Cycle 3 ICON projects are around at BIDS23, (https://github.com/eerie-project/EERIE_hackathon_2023/ ?https://github.com/nextGEMS/nextGEMS_Cycle3 ? @koldunovn ?https://easy.gems.dkrz.de/Processing/healpix/healpix_starter.html ) I would love to hear their user stories with healpix, and also how they will make their data available (DestinE?).

koldunovn commented 1 year ago

Wow, nice ideas! I haven't heard that anyone I know from EERIE, DestinE or nextGEMS plan to participate in this code sprint. Also we will have our EERIE Hackathon this week.

From nextGEMS and EERIE the notebooks with examples of how we use unstructured data are available, and ICON data for the last nextGEMS cycle are all in HEALPix. Access to data is currently restrictive if you don't have DKRZ account, but if there is interest we can provide subset. In EERIE we are trying also to expose data through xpublish , but it's in early stage.

Those kind of projects would be great to see on nextGEMS Hackathon, that will be held 4-8 March somewhere around Hamburg. Let me know know if there is interest and I will get you in contact with nextGEMS people :)

benbovy commented 1 year ago

Great to hear from you @koldunovn! The notebook examples will be helpful. I've created an account on DKRZ so I'm now able to ask for joining a project there if needed.

koldunovn commented 1 year ago

Great! If you interested in HEALPix I would start form this one, and explore the rest of the collection: https://easy.gems.dkrz.de/Processing/healpix/healpix_starter.html

Unstructured (FESOM2) and semi-structured data covered here: https://github.com/nextGEMS/nextGEMS_Cycle3

We are currently developing also EERIE notebooks, but there is a lot of examples using nextGEMS data as well: https://github.com/eerie-project/EERIE_hackathon_2023/

If you looking for something more concrete, let me know.

benbovy commented 1 year ago

We started a shared document on HackMD for the sprint: https://hackmd.io/UBM5L6YNRlG73e3eVo6vOg

benbovy commented 1 year ago

Xarray DGGS extension library in development here: https://github.com/benbovy/xdggs