let's enumerate all the ways to represent a "grid" in python

rabernat commented 6 years ago

As discussed in the Pangeo meeting, there are currently many different models for a grid in python. Some of these are explicit, some are implicit in how packages work. I am talking about both structured and unstructured grids.

Let's try to put together a list of what these are and how they differ. For now, I will just throw out some links:

What else should be on this list?

cc @dopplershift, @bekozi, @rsignell-usgs

rabernat commented 6 years ago

https://github.com/NCAR/wrf-python

fmaussion commented 6 years ago

I'll add "my" Grid object to the list. Note that this is not an endorsement ;-)

http://salem.readthedocs.io/en/latest/gis.html

mnlevy1981 commented 6 years ago

It sounds like you guys want a python class that has some method objects to act on the grid rather than simply an xarray schema to use, so this may not be the most helpful... but CESM wants grids in the SCRIP format: that's a netcdf file with very specific fields. From a very old version of the User Guide (Figure 2.2.2 on page 8)

netcdf remap_grid_T42 {
dimensions:
        grid_size = 8192 ;
        grid_corners = 4 ;
        grid_rank = 2 ;
variables:
        long grid_dims(grid_rank) ;
        double grid_center_lat(grid_size) ;
                grid_center_lat:units = "radians" ;
        double grid_center_lon(grid_size) ;
                grid_center_lon:units = "radians" ;
        long grid_imask(grid_size) ;
                grid_imask:units = "unitless" ;
        double grid_corner_lat(grid_size, grid_corners) ;
                grid_corner_lat:units = "radians" ;
        double grid_corner_lon(grid_size, grid_corners) ;
                grid_corner_lon:units = "radians" ;
// global attributes:
                :title = "T42 Gaussian Grid" ;
}

At some point a grid_area field was added -- ESMF mapping tools can be used with SCRIP grid files, and the area field is necessary for conservative remapping. Anyway, I would posit that xr.open_dataset([SCRIP FILE]) is a way to represent a grid in python.

Interestingly enough, SCRIP was originally a full suite of Fortran-based grid tools (SCRIP stands for Spherical Coordinate Remapping and Interpolation Package) though the package itself is no longer supported... presumably because ESMF does everything it used to do.

fmaussion commented 6 years ago

https://github.com/pytroll/pyresample

cc @djhoese

djhoese commented 6 years ago

@fmaussion Thanks for the mention. So many good projects listed here that I didn't know about. For everyone else, I am one of the maintainers of the pyresample package which is very similar to some of these other libraries and focuses on resampling large satellite data arrays to new projections (used by the satpy library.

Pyresample has a couple different geometry objects and some are subclasses of others so I'll just list the two most important and most used:

SwathDefinition(lons, lats): Unstructured (non-uniform spacing) lon/lat arrays to represent satellite swath data like those from polar-orbitting satellite instruments (VIIRS, MODIS, etc).
AreaDefinition (name, proj4_dict, rows, cols, lower left coordinate, upper right coordinate): A projected and uniformly spaced grid of pixels. The datum and projection information is all specified by a dictionary of PROJ.4 parameters. The location of the grid is specified by the extents (our edges of the out edge pixels) and the resolution is determined by how many pixels are specified with the rows and cols parameter. There are other parameters but they aren't useful here.

I should also point out a similar issue to this on the pyresample repository: https://github.com/pytroll/pyresample/issues/57 Where it was discussed how users might want to specify their grids/areas. I have an undergrad student employee working on this currently for in-python specification and on-disk specification in a yaml file. This includes things like specifying them with a center point and a radius for scientific users who are used to specifying regions of interest.

For anyone who hasn't been linked to it already this type of information may be useful to the geoxarray project I'm starting (https://github.com/pydata/xarray/issues/2288, https://github.com/geoxarray/geoxarray). I had been considering duplicating some of the work done by pyresample's AreaDefinition and SwathDefinition objects in geoxarray, but don't have anything implemented yet.

So...what is the end goal of this issue?

djhoese commented 6 years ago

cc @mraspaud and @leouieda

rabernat commented 6 years ago

So...what is the end goal of this issue?

My goals for now is just to understand the extent of duplication of effort in developing grid models and to start a discussion on whether a more unified approach would be worthwhile. Some of the downsides of having many different implementations for very similar concepts are:

wasted developer time
incomplete feature sets (because each package currently implements only the things it needs for the application it has in mind)
reduced interoperability between packages

As a concrete example of the lack of interoperability, xgcm, my package, has its own representation for finite volume grid cells (cell center, cell face, etc.). @JiaweiZhuang's xesmf package can do conservative regridding, for which it needs to know about the cell extents. Lacking a universally accepted way to describe these grids, Jiawei invented a [completely reasonable] new convention for specifying the grid cell bounds. Consequently, despite ostensibly knowing about the same type of object, xgcm and xesmf are not interoperable. (I am not trying to pick on Jiawei--xesmf is amazing! Examples could have been drawn from any combination of two packages from the above list.)

Reviewing the docs from the list, I think there are two basic categories of things called "grids":

grid as a geographical projection: this is kind of a GIS-based view. The point of the grid objects is to allow for georeferencing / resampling. Focused on remote sensing. proj4 usually involved. [salem, pyresample, geoxarray (?), metpy]
finite volume grid: this a model (GCM) based view. The grid has "cells" which are polygons (rectangles for structured grids) with finite geographic extents. Variables may live at different points on these cells (e.g. face, center, or corner). [esmf, pycomodo, xesmf, xgcm, ugrid, sgrid, metpy, mpas-analysis]

Just as numeric and numarray decided to merge into numpy, it could be beneficial for our [much smaller] software ecosystem to merge some of the above efforts.

rabernat commented 6 years ago

More:

rabernat commented 6 years ago

p.s. Worth noting that definition 2 (finite volume grid) doesn't have to live on the Earth. This type of grid is useful also for models run in cartesian coordinates.

fmaussion commented 6 years ago

there are two basic categories of things called "grids":

Right, sorry for spaming this issue if you were just talking about category 2 in the first place.

pbranson commented 6 years ago

Right, sorry for spaming this issue if you were just talking about category 2 in the first place.

Whilst there may be two different 'world views' on grids (one from the modellers perspective and one from the earth observation perspective) I feel like some of the important scientific questions will be more easily answered if both didnt also succumb to the interoperability issue that @rabernat outlined (which I think is the main point of this issue)

finite volume grid: this a model (GCM) based view. The grid has "cells" which are polygons (rectangles for structured grids) with finite geographic extents. Variables may live at different points on these cells (e.g. face, center, or corner). [esmf, pycomodo, xesmf, xgcm, ugrid, sgrid, metpy, mpas-analysis]

This could be generalised even further to say that the variables may also be defined on grids, adding additional dimensions beyond space and time (i.e. wavelengths of light, surface wave direction and frequency energy bins etc etc), where a lot of the same properties relating the conservative up- and down-sampling still also apply

So ideally the specification and implementation allows for a hierarchical definition of a grid from 1D through to nD?

...maybe too far down the rabbit hole and reminds me of some of the discussions I have read around xarray and the fundamental challenge dealing with nD-data resulting from the distinction between dimensions (array indicies) and coordinates

djhoese commented 6 years ago

@rabernat The xgcm versus xesmf gridded is interesting because looking at the two packages' documentation on grids I wouldn't immediately assume they were solving the same problem. Looking at the xesmf documentation the examples for gridding seem to be based on lon/lat grids. The xgcm seems like it handles a lot more cases but being unfamiliar with the package it isn't clear to me how to apply it to my remote sensing problems. I like the idea of describing the grids as DataArray objects, however, in my data cases I usually deal with 1000s x 1000s pixel grids (22k x 22k on the larger end) which isn't always represented well with xarray coordinates (not sure if this is considered large to the other things discussed here). In my satpy library (which uses pyresample) we use dask underneath xarray to help with the memory and performance issues that come up with processing these larger arrays. If we store things in .coords then xarray currently has to compute these which can take up memory and time to load information from the data files, depending on what the data is.

With the "geoxarray" package I'm starting I was going to add an xarray accessor (http://xarray.pydata.org/en/stable/internals.html#extending-xarray) to provide a simple way to access the coordinate reference system (CRS). It sounds like anything that comes out of this issue could work well with "geoxarray" where you could have a grid description from some gridding package and add CRS information to it in a way that geoxarray could support.

_{The brainstorm is starting...}

martindurant commented 6 years ago

to provide a simple way to access the coordinate reference system (CRS)

I would request that you also keep in mind astronomical World Coordinate Systems (WCSs), normally stored in the headers of FITS files. I don't know how similar the types of coordinate mappings might be; for astronomy FITS, it is often affine, but could be higher-order polynomials.

rabernat commented 6 years ago

he xgcm versus xesmf gridded is interesting because looking at the two packages' documentation on grids I wouldn't immediately assume they were solving the same problem. Looking at the xesmf documentation the examples for gridding seem to be based on lon/lat grids. The xgcm seems like it handles a lot more cases

This is kind of my point. I implemented the parts of a "grid" that I cared about immediately for my applications (focused on finite differencing and related operations on the native grid space), but have not yet implemented the more "geographical" aspects of grids.

rabernat commented 6 years ago

Oh we should also add all the ways that geoviews handles grids:

http://geo.holoviews.org/user_guide/index.html

cc @philippjfr

djhoese commented 6 years ago

@martindurant I won't exclude the idea, but I'm also not sure I know the differences needed by the astronomy community. I read a little bit of astropy's wcs documentation, but I'm not sure I have a great grasp on the concepts quite yet.

@rabernat Most of that looks like 2D lon/lat grids with values for each pixel.

I also wanted to point out that in the remote sensing field we typically deal with data as an "area" where one data point represents an area. The other common one is a single point. It is assumed that the coordinate for a pixel represents the center of the pixel/area/cell.

martindurant commented 6 years ago

The astro WCS is basically an object encoding an (arbitrary) function for transforming pixel coords to logical coords and usually also the inverse. In the simplest case, there will be a few keys in the file metadata like reference logical x/y, reference pix x/y, scale, but xarray doesn't need to know how things are actually stored. As was discussed at scipy, the problem comes when the coords can't be expressed as a simple grid of values at each pix location; I expect you are facing exactly the same problem.

djhoese commented 6 years ago

Was this discussed at a BOF? I missed one I wanted to go to the first day because I had to represent vispy at the visualization BOF. When you say "can't be expressed" do you mean the coordinates have to be specified for each pixel (non-uniform spacing)?

bekozi commented 6 years ago

This has been a very interesting discussion so far. The ESMF mesh/grid model can serve as a useful reference point having been built up over years of use cases. This is not to say it is the best user-facing grid model! I expect ESMPy will keep pretty close to ESMF internals with xESMF being the best way to interact with xarray packages. With ESMF, we are working to integrate MOAB as our mesh backend.

With the "geoxarray" package I'm starting I was going to add an xarray accessor to provide a simple way to access the coordinate reference system (CRS).

@djhoese Nice. Are you standardizing on CF convention with PROJ.4 or something? At the workshop, I talked with @dopplershift and @niallrobinson about creating packages similar to cftime addressing the CRS issue and general metadata interpretation. Are you aiming for something similar?

Looking at the xesmf documentation the examples for gridding seem to be based on lon/lat grids.

@djhoese Currently xESMF supports a subset of the features available in ESMPy...

maybe too far down the rabbit hole

@pbranson I don't think so at least. Time-varying grid/meshes have applications in sea level rise modeling for example. It usually the edge cases that break conventions!

xgcm and xesmf are not interoperable [for conservative regridding]

@rabernat & @JiaweiZhuang, is the xESMF/xGCM conservative regridding compatiblity issue related to how corners are handled? I expect xESMF has corners with a (lat+1, lon+1) dimension similar to ESMPy while xGCM maybe does (lat, lon, corner_count). The ESMF corners method was chosen to reduce coordinate duplication, but it is annoying to use. I have some Python functions to move between the two representations that might be helpful. Would xGCM be the right place to add this? Maybe this has long since been addressed...

djhoese commented 6 years ago

@bekozi It is probably best to skim through the related xarray issue here: https://github.com/pydata/xarray/issues/2288

But yes, the idea would be that you could do my_data_arr.geo.crs and get a CRS object that is dynamically created from the contents of your DataArray/Dataset object. My hope is that it would handle multiple common ways of describing projections including the CF standard way (grid_mapping, etc) and my usual use case of PROJ.4 strings/dicts. With the optional dependency of rasterio I'm sure it could also convert to WKT and other formats. The library would provide CF NetCDF writing too on top of the existing to_netcdf provided by xarray, at least that's the idea. All the other common grid parameters we would need for describing a data grid are expected in the DataArray metadata (x/y pixel center coordinates and array shape).

I'm not sure how much PROJ.4 and CF projection works with astronomy WCS stuff (@martindurant).

Now if only I had the time to write it...

JiaweiZhuang commented 6 years ago

is the xESMF/xGCM conservative regridding compatiblity issue related to how corners are handled? I expect xESMF has corners with a (lat+1, lon+1) dimension similar to ESMPy while xGCM maybe does (lat, lon, corner_count).

Not sure how xgcm represents corners internally, but (lat+1, lon+1) vs (lat, lon, 4) is indeed an annoying issue (pydata/xarray#1475).

The "incompatibility" is largely because the totally different purposes of the two packages. xgcm has a strong focus on finitely-difference calculations, which requires switching between staggering locations and the connectivity information between (cubed-sphere) panels. It does store corner information nicely, but most of other features are an overkill for xESMF. xESMF just needs a plain (lat+1, lon+1) numpy array to compute cell overlapping for conservative regridding, not much else.

I want to keep the absolutely simple user interface for xESMF and minimize the dependency on advanced data structure (pure numpy arrays should always work). I got many users migrated from MATLAB/NCL/IDL and they just barely know xarray and even numpy. Asking them to learn other fancier packages like xgcm will blow their mind... Maybe the interoperability with other packages can be built on top of the current interface?

JiaweiZhuang commented 6 years ago

Given the diversity of user needs and the huge cultural gap between different research communities, I feel that it is almost impossible to invent a one-for-all representation of "grid".

The xgcm vs xESMF issue is a rather small cultural gap: Many xESMF users are from the environment & air quality modeling community (my field of study); they only care about conservatively regridding chemical concentration and emissions fields, and never worry about taking spatial derivatives (vorticity, divergence). So they will have little incentive to learn the additional features that xgcm provides.

A much bigger cultural gap is between GIS and numerical modeling. pydata/xarray#2288 is indeed a great proposal from a GIS perspective, but the modeling people mostly just think a "grid" as simple 1D/2D arrays of latitude&longitude. GIS terms like "datum", "geodetic" and even simply "coordinate reference system" can feel quite exotic to atmospheric modeling people. The major reason is that atmospheric models always assume a perfect sphere, so people never worry about the choice of geodetic datum as in GIS (although sometimes they probably should; see Monaghan et al. 2013 and Cao et al. 2017).

So, for the comment:

Looking at the xesmf documentation the examples for gridding seem to be based on lon/lat grids.

Exactly, because xESMF (recall the name Earth System Modeling Framework) is written more for the numerical modeling people. On the contrary, GIS people will probably find xESMF inconvenient, because the grid has to be expressed in true lat/lon values on a perfect sphere.

rabernat commented 6 years ago

Asking them to learn other fancier packages like xgcm will blow their mind

@JiaweiZhuang -- thanks for your response. I guess it was not sufficiently clear from my message that I am not proposing to merge any existing packages tomorrow or force your users to adopt an unfamiliar new tool. xesmf is an amazing piece of software which has already developed a wide following. The last thing I would want would be to undermine this success in any way.

I am mostly concerned here with the internals of these packages, i.e. the code that the developers have to write to represent a grid. So yes, while xgcm and xesmf currently have very different user-facing goals, internally, they both had to create very similar data structures related to the representation of finite volume grid cells. (And so did many of the other packages on my list.)

I agree that distinction between "GIS grids" and "GCM grids" an important one. I think this is a useful outcome of this discussion.

JiaweiZhuang commented 6 years ago

@rabernat Your clarification is clear and I think we are on the same page (Also, thanks very much for the kind words about xESMF!)

I am mostly concerned here with the internals of these packages

A useful thing would be identifying small, individual, commonly-used functionalities for grid operations, and adding them to xarray's core or a very small extension module. For example:

1) Validate the continuity of grid coordinate values

Right now it is easy to create an illegal coordinate with discontinuous/non-monotonic values:

dr = xr.DataArray([1,1,1,1], coords={'x': [1,2,4,3]}, dims='x')

It will only throw an error at plotting, not at creation.

Seems like xgcm Grid() can also accept this kind of object.

In [21]: ds  # x_c is not monotonic
Out[21]:
<xarray.Dataset>
Dimensions:  (x_c: 9, x_g: 9)
Coordinates:
  * x_c      (x_c) int64 1 2 3 9 8 7 6 5 4
  * x_g      (x_g) float64 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
Data variables:
    *empty*

In [22]: Grid(ds)
Out[22]: <xgcm.Grid>

Discontinuity in the grid coordinates once led to a bug in xESMF (JiaweiZhuang/xESMF#16). It would be nice to have something like xr.assert_valid_coords().

2) Validate the consistency between cell centers and corners

For a grid object passed to xESMF, its center coordinates are totally decoupled from its corner coordinates (they are just separate numpy arrays). A consistency checking ("is the mean of the cell corners equal to the cell centers?") is quite useful for preventing data mismatch and for inferring cell boundaries (https://github.com/JiaweiZhuang/xESMF/issues/5#issuecomment-337761540). Not sure how xgcm ensures data consistency.

Actually, xarray already has _is_monotonic and _infer_interval_breaks in xarray/plot/plot.py that kind of do what I want, although they still need to consider special cases such as curvilinear mesh over the pole. Maybe these internal functions can be polished and made more explicit, like xr.apply_ufunc() that is used both internally and externally.

Those functionalities are pretty general for all grid objects, regardless of actual data presentation. This doesn't address the issue of diverging data structures, but should avoid some duplicated efforts.

fmaussion commented 6 years ago

A much bigger cultural gap is between GIS and numerical modeling. pydata/xarray#2288 is indeed a great proposal from a GIS perspective, but the modeling people mostly just think a "grid" as simple 1D/2D arrays of latitude&longitude. GIS terms like "datum", "geodetic" and even simply "coordinate reference system" can feel quite exotic to atmospheric modeling people. The major reason is that atmospheric models always assume a perfect sphere, so people never worry about the choice of geodetic datum as in GIS (although sometimes they probably should; see Monaghan et al. 2013 and Cao et al. 2017).

I have to comment on this, because I think this is only partly true. Yes, atmospheric modeling people often do not know what a map projection is, or a "coordinate reference system". But this is causing them a lot of trouble: most (all?) limited area models like WRF do think of the earth as a sphere indeed, but they use a map projection to represent their grid, which is cartesian in eastings-northings (the map projection reference frame) and irregular in lon-lat. People non-aware with CRS will deal with the issue as following:

use the 2D lons and lats to plot them on a cartopy map (which is already slightly wrong because it the requires to use the _infer_interval_breaks trick from xarray, which is only an approximation)
regrid them to a regular lon-lat grid (maybe using tools like xesmf) in order to make comparisons with other gridded datasets

The much better way to deal with this would be to stay in the original map projection and regrid the other datasets you want to compare your model to (e.g. satellite products) into the original model projection.

I was surprised to see that a blog post I wrote about WRF map projections is the most visited entry on my webpage. I think that many people in the python/xarray ecosystem would benefit from a unified handling of CRS in a tool which will hide all this complexity from them. Combined with a tool like pyresample or xesmf they would have all they need to work in cartesian grids instead of relying on the approximations of lon/lat grids.

JiaweiZhuang commented 6 years ago

@fmaussion Thanks, I really enjoy your blog post. But I have a somewhat opposite opinion on this.

But this is causing them a lot of trouble: most (all?) limited area models like WRF do think of the earth as a sphere indeed, but they use a map projection to represent their grid, which is cartesian in eastings-northings (the map projection reference frame) and irregular in lon-lat.

The much better way to deal with this would be to stay in the original map projection and regrid the other datasets you want to compare your model to (e.g. satellite products) into the original model projection.

Whether to use true lat/lon or the projected coordinates largely depends on the problem a user wants to solve. For common questions like "what's the temperature at this location", using lat/lon is way more convenient/intuitive because the location of a city or a observation site is mostly expressed in lat/lon. More concretely, the xarray.Dataset should have coordinates in true lat-lon values, so users can simply call ds.sel(lat=a, lon=b). For the trouble with WRF, I'd rather blame WRF's confusing output format than blame users.

Directly using projected coordinates seems quite rare in atmospheric modeling. The only use case I am aware of is deriving numerical schemes on curvilinear grids. For example, instead of using lat/lon to describe a cubed-sphere panel, you can use local, orthogonal variables ranging from -1 to 1, to make the math much cleaner (ref: Section 3.1. Gnomonic projection in Finite-volume transport on various cubed-sphere grids). But this notation is never exposed to users -- cubed-sphere models like FV3 write output data with true lat/lon coordinates.

For conservative regridding, it is important to stay in the true lat/lon space. Most projections, except equal-area ones, do not preserve the cell area. Conformal projection preserves local shape but distorts the area ratio between cells. Doing conservative regridding in the projected space does not preserve the area-weighed sum on the true sphere.

use the 2D lons and lats to plot them on a cartopy map (which is already slightly wrong because it the requires to use the _infer_interval_breaks trick from xarray, which is only an approximation)

This plotting issue is because plt.pcolormesh's plotting mechanism uses cell boundaries (https://github.com/xgcm/xmitgcm/issues/15#issuecomment-314326358), not because of the use of lat/lon instead of projected coordinates. Even on orthogonal, regular grid, you still need to pass boundaries to plt.pcolormesh.

I think that many people in the python/xarray ecosystem would benefit from a unified handling of CRS in a tool which will hide all this complexity from them.

Instead of a "unified handling" between GIS and modeling, I'd prefer having a convenient converter between the two, while allowing each community to keep their existing culture.

djhoese commented 6 years ago

Instead of a "unified handling" between GIS and modeling, I'd prefer having a convenient converter between the two, while allowing each community to keep their existing culture.

I see these as the same thing and that is at least the initial goal for geoxarray. You have lon/lat coordinates, great use them. You have projection information following the CF standard, go for it. You have a PROJ.4 string and X/Y coordinates, you can use that too. So the data is in the format the user is used to, but libraries can use a unified tool for handling the conversion and handling the data.

rabernat commented 6 years ago

cc @adcroft, who may be interested in the discussion

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

fmaussion commented 5 years ago

@JiaweiZhuang , sorry for my late follow-up on this old thread. You wrote:

For conservative regridding, it is important to stay in the true lat/lon space. Most projections, except equal-area ones, do not preserve the cell area. Conformal projection preserves local shape but distorts the area ratio between cells. Doing conservative regridding in the projected space does not preserve the area-weighed sum on the true sphere.

Here I understand the position you are taking (that of a user wanting to preserve physical quantities on the true sphere), but in limited area models like WRF the conservation equations are solved on the grid in the projection space, aren't they?

rabernat commented 5 years ago

I am reopening this issue in light of recent discussions. Also tagging @jbednar, @philippjfr, @NCAS-CMS, @davidhassell, @bnlawrence.

I have recently discovered some new grid-related software:

cf-python (bitbucket repo) - "The python cf package implements the CF data model for the reading, writing and processing of data and its metadata." Very similar to xarray / iris.
cfdm (github repo)- "The Python cfdm package is a complete implementation of the CF data model." Still trying to learn how this one is related to cf-python.

It looks like the folks at @NCAS-CMS are building a completely new stack of python based software for analysis of climate data, including something dask-like called LAMA. I opened an issue to try to learn more.

At our last pangeo meeting, we discussed how the py-viz team at Anaconda might be able to help adapt datashader to geospatial regridding tasks. We are also eager for an update from @JiaweiZhuang on the status and roadmap for xESMF. Also curious to hear from @bekozi about what's going on within ESMF itself.

I would like to propose we organize a call to coordinate the various people working on different aspects of this problem. Please give a 👍 to this comment if you would be interested in joining such a call.

martindurant commented 5 years ago

Perhaps work mentioning healpix in this space, a way of subgridding non-cartesian (usually spherical) geometry data.

andersy005 commented 5 years ago

Ccing @matt-long as you may be interested in this discussion.

JiaweiZhuang commented 5 years ago

We are also eager for an update from @JiaweiZhuang on the status and roadmap for xESMF.

Thanks for following up on this! Some planned updates are mostly about API and computation, such as fixing the ugly offline weight files (JiaweiZhuang/xESMF#11) and incorporating dask and distributed. I don't have plans to change xESMF's grid representation, partly because I am not sure how to improve it :) I would be interested in hearing from other communities.

rabernat commented 5 years ago

Sorry for the delay scheduling this. I have created a doodle poll: https://doodle.com/poll/cpsh6a2n6zy4sxmc

I have listed options at 11AM and 12PM EST, which seems to be the only feasible times that can accommodate both US west coasters and Europeans. Sorry for the inconvenience.

If you're interested in joining, please fill out the poll. Hopefully we can find a time that works for everyone.

ChrisBarker-NOAA commented 5 years ago

Wow! lots of discussion here -- good stuff! A couple thoughts. I see these in the original list:

These are related, but with a different focus: The first two are about conventions for storing (and sharing) grids (unstructured and staggered) in netcdf. Which also implies a data model.

gridded is a Python package for working with grids that follow the data model of these these conventions (and theoretically others -- it does support rectangular grids as well.

They key idea is that it presents an API where users can think of the data as an abstract field of values, and the code takes care of the interpolation and all that for you. -- i.e. you can ask the question: what is the value of this variable at this lat-long-time point?, and you can ask that the same way for all supported grid types.

It would be nice for it to be more general -- but it does currently assume things relevant to oceanographic models -- horizontal coords are lat-lon, (thought it does not handle the round earth) vertical is structured, etc.

Anyway, in terms of this effort -- we probably need to be clear about the distinction between DATa formats / data models and code/algorithms.

Also -- general purpose tools like xarray are built around the concept of an array -- which represents logically rectangular grids well, but breaks down with unstructured grids. So it's a real challenge to integrate these two concepts.

Currently, gridded requires data to be numpy-like array object, so theoreticall you could plug in xarray and dask, but that has not been tested. I've also spoken with Stephan Hoyer about how xarray could better support unstructured grids -- he's open to providing some kind of hooks to do that, but no details have been worked out.

jhamman commented 5 years ago

cc @pelson and @DPeterK - would be great to get someone from the Iris team to join this call.

jbednar commented 5 years ago

Just for a bit of background on Anaconda's role, @philippjfr and I will be representing the various PyViz projects (GeoViews, hvPlot, Datashader). These are relevant to the discussion in a few different ways:

GeoViews provides a unified high-level interface for regularly gridded, irregularly gridded, and non-gridded data in different coordinate systems, and so it needs to accept data of various types (xarray DataArray, Iris cube, etc.), convert between them, sample or slice from them, and visualize them (reprojecting as needed). Ideally GeoViews will just be a convenient API for functionality implemented by lower-level libraries, currently including xESMF or Datashader for regridding and Cartopy/proj.4 for reprojection.
hvPlot attaches .hvplot methods directly onto various data libraries (Xarray, GeoPandas, Dask, Intake), returning GeoViews objects. hvPlot is the interface we expect most people to start with in practice for GeoViews (and for HoloViews, the non-geographic equivalent).
Datashader is a backend library for rasterizing anything you give it, turning it into a regular grid represented as an Xarray DataArray. Most of the Datashader examples are rasterizing into screen coordinates for visualization purposes, but it is actually a general-purpose gridding tool, taking point data, irregular trimesh grids, regular raster grids, polylines, and soon quadmeshes and polygons, returning a DataArray that can be used to compare and combine various types of data whether or not you are visualizing it. Datashader is not meant to be a geospatial-specific library; it doesn't handle reprojections and it does gridding and regridding only in a single shared coordinate space at any one time. Datashader also tends to accept grid and mesh data in the most convenient and efficient format for what it's doing, which is not necessarily an appropriate simulation format or interchange format.
Intake is a PyViz-adjacent Anaconda project for data cataloging that is also relevant, in that it covers various gridded data formats we will discuss and supports associating data files with hvPlot/GeoViews/Datashader-based default visualizations. (We're also working on an ncview-like explorer for data files, using Intake and hvPlot.)

Anaconda is funded by Pangeo and we are considering whether we can help solve gridding-related issues by extending these tools or by incorporating interfaces to other tools into higher-level APIs like GeoViews. E.g. if we can figure out a way to express the coordinate transformations involved in regridding in a generic way, we could extend Datashader to cover that case. But note that Datashader will not wrap around any domain-specific lower-level libraries; it implements all algorithms from scratch using Dask and Numba for performance and scalability, and because it has extensive usage outside of geoscience it cannot be tied to the geo stack (GDAL, proj.4, etc.). We're definitely interested in brainstorming how we could make it more useful for geoscience-related regridding, though!

pbranson commented 5 years ago

I won't be able to contribute much but would really like to be able to listen to the discussion. Any chance it could be recorded?

From: James A. Bednar notifications@github.com Sent: Thursday, February 28, 2019 12:42:09 AM To: pangeo-data/pangeo Cc: Paul Branson; Mention Subject: Re: [pangeo-data/pangeo] let's enumerate all the ways to represent a "grid" in python (#356)

Just for a bit of background on Anaconda's role, @philippjfrhttps://github.com/philippjfr and I will be representing the various PyVizhttp://pyviz.org projects (GeoViews, hvPlot, Datashader). These are relevant to the discussion in a few different ways:

GeoViewshttp://geoviews.org provides a unified high-level interface for regularly gridded, irregularly gridded, and non-gridded data in different coordinate systems, and so it needs to accept data of various types (xarray DataArray, Iris cube, etc.), convert between them, sample or slice from them, and visualize them (reprojecting as needed). Ideally GeoViews will just be a convenient API for functionality implemented by lower-level libraries, currently including xESMF or Datashader for regridding and Cartopy/proj.4 for reprojection.
hvPlothttp://hvplot.pyviz.org attaches .hvplot methods directly onto various data libraries (Xarray, GeoPandas, Dask, Intake), returning GeoViews objects. hvPlot is the interface we expect most people to start with in practice for GeoViews (and for HoloViews, the non-geographic equivalent).
Datashaderhttp://datashader.org is a backend library for rasterizing anything you give it, turning it into a regular grid represented as an Xarrayhttp://xarray.pydata.org DataArray. Most of the Datashader examples are rasterizing into screen coordinates for visualization purposes, but it is actually a general-purpose gridding tool, taking point data, irregular trimesh grids, regular raster grids, polylines, and soon quadmeshes and polygons, returning a DataArray that can be used to compare and combine various types of data whether or not you are visualizing it. Datashader is not meant to be a geospatial-specific library; it doesn't handle reprojections and it does gridding and regridding only in a single shared coordinate space at any one time. Datashader also tends to accept grid and mesh data in the most convenient and efficient format for what it's doing, which is not necessarily an appropriate simulation format or interchange format.
Intakehttps://intake.readthedocs.io is a PyViz-adjacent Anaconda project for data cataloging that is also relevant, in that it covers various gridded data formats we will discuss and supports associating data files with hvPlot/GeoViews/Datashader-based default visualizations. (We're also working on an ncview-like explorer for data files, using Intake and hvPlot.)

Anaconda is funded by Pangeo and we are considering whether we can help solve gridding-related issues by extending these tools or by incorporating interfaces to other tools into higher-level APIs like GeoViews. E.g. if we can figure out a way to express the coordinate transformations involved in regridding in a generic way, we could extend Datashader to cover that case. But note that Datashader will not wrap around any domain-specific lower-level libraries; it implements all algorithms from scratch using Daskhttp://dask.pydata.org and Numbahttp://numba.pydata.org for performance and scalability, and because it has extensive usage outside of geoscience it cannot be tied to the geo stack (GDAL, proj.4, etc.). We're definitely interested in brainstorming how we could make it more useful for geoscience-related regridding, though!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pangeo-data/pangeo/issues/356#issuecomment-467865832, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AM3bQJTYBo0WruOCzxBctYXwjz7rbT82ks5vRosxgaJpZM4V8xGV.

JiaweiZhuang commented 5 years ago

@jbednar Thanks for the detailed updates. I use geoviews quite heavily; if any potential features in xESMF can help geoviews development please tell me know.

I am aware of geoview's resampling example. xESMF's annoying clean_weight_files() should be fixed by JiaweiZhuang/xESMF#11. Unstructured grid will be a headache though; the way datashader represents unstructured meshes/polygons might be a useful reference for implementing an unstructured grid data model...

ChrisBarker-NOAA commented 5 years ago

Honestly, I haven't looked deeply into most of these, but an approach we've tried to take with gridded is bottom-up, that is:

Starting with a data model(s)
Adding low-level algorithms
- cell location
- interpolation

Then building higher-level functionality on top of those:

particle tracking
visualization
regridding etc.

Where we have struggled with "uptake" is that we have only built the higher-level functionality that we actually need. Whereas most users come with a problem to solve (a particular analysis, regridding, ...), and if gridded can't already solve it for them right away, then they go looking for another package that does.

I may be wrong, but many of these other packages seem to take a more top-down approach: starting with a problem to solve (regridding) and then solving it. And maybe even a specific enough problem (a particular model, grid type, etc) that it really isn't applicable to other problems (at least not the documented / exposed API). And they may be part of a seemingly heavy-weight framework.

So this is a great effort to pursue -- we are all re-implementing many of the same things, but in different contexts / frameworks.

Ideally, what I would love to see is all of us building on top of a low-level framework -- this could be similar to gridded, or maybe look quite different, but the goal of a bottom up toolbox of code would be there.

NOTE: Another key issue with gridded uptake is that it re-implements part of xarray (while not re-implementing most of its functionality). But a lot of folks have built xarray into their workflow, and do not (reasonably enough) want to abandon all that.

The reason gridded is not built on xarray is twofold:

1) when we started (early pyugrid development), xarray was brand new (and called xray), not ready for operational use, and Stephan was quite focused on his core use-cases -- "xray will not use pyugrid". So we headed out on our own.

2) xarray really is all about arrays -- that is, logically rectangular chunks of data, and I'm still not sure how to mingle two quite different data models. I think it could probably be done, but we haven't had the bandwidth to pursue it.

Stephan has expressed interest in extending the xarray API to accommodate a more abstract model for interpolation, etc -- it would be great to pursue that.

rabernat commented 5 years ago

@ChrisBarker-NOAA - thanks for all your detailed comments!

Ideally, what I would love to see is all of us building on top of a low-level framework

From my (admittedly biased) point of view, this low-level framework is xarray. Datashader and xESMF already consume and produce xarray data structures.

2. xarray really is all about arrays -- that is, logically rectangular chunks of data, and I'm still not sure how to mingle two quite different data models. I think it could probably be done, but we haven't had the bandwidth to pursue it.

We've had this conversation before, but I'll state my view on this again. Xarray is all about data the conforms to the common data model, i.e. data that can be stored in a netCDF files. Unstructured conforms to this data model and can be stored in netCDF file. Ipso facto, xarray can work with such data. From xarray's point of view, datasets from unstructured model output netCDF files would have dimensions time, depth, cell_index (3D), rather than from a regular lat lon grid time, depth, lat, lon (4D).

The challenges which arise are primarily:

visualizing the data in geographic coordinates
aggregating over geographic coordinates (e.g. zonal mean)
regridding
performing operations involving neighboring gridpoints (e.g. finite differences)

It is crucial to recognize that challenges 1 and 2 also arise with curvilinear orthogonal grids, which many of us are using every day with xarray (see this binder for examples). But there is lots of room for improvement, which is one of the aims of this discussion.

For 3, ESGF can handle regridding of unstructured data. But this capability is not exposed in xESMF.

As for 4, that's why I created xgcm, which is currently aimed at structured grids. But the xgcm approach could easily be extended to unstructured grids via the graph adjacency matrix.

My overall points are that

unstructured data shares many of the same problems with regular structured curvilinear grids
xarray is a reasonable foundation for any data that can be stored in netCDF files
to optimize the sustainability of all our work, we should try to specialize as late as possible, pushing as much common functionality as we can into the lower layers of the stack (e.g. numpy, xarray), which have more users and maintainers. We should only build from scratch the "last mile" that is truly unique to our specific sub-problem.

dopplershift commented 5 years ago

I agree that xarray is a great framework to start with. But one fundamental challenge xarray needs to deal with before it can be a universal solution for both structured and unstructured grids is that it needs to cope with 2D coordinates—i.e. lat/lon are allowed to be functions of multiple dimensions but still serve as fundamental coordinates for other variables. This is something supported by the CDM (and output by THREDDS on a regular basis) and causes me no shortage of problems when applying xarray in a generic fashion to our datasets.

rabernat commented 5 years ago

I agree that xarray is a great framework to start with. But one fundamental challenge xarray needs to deal with before it can be a universal solution for both structured and unstructured grids is that it needs to cope with 2D coordinates—i.e. lat/lon are allowed to be functions of multiple dimensions but still serve as fundamental coordinates for other variables. This is something supported by the CDM (and output by THREDDS on a regular basis) and causes me no shortage of problems when applying xarray in a generic fashion to our datasets.

I agree with you 100%. That's why I started a pull request on xarray to make this exact fix. I have not been able to finish the PR however. Since many people agree that this is a high priority, perhaps we can coordinate and find some developer time to complete that work.

ChrisBarker-NOAA commented 5 years ago

Ideally, what I would love to see is all of us building on top of a low-level framework

👍

From my (admittedly biased) point of view, this low-level framework is xarray. Datashader and xESMF already consume and produce xarray data structures

Sure, and if I were starting now, I’d do that. Currently the low level data structures gridded is built on is numpy-like arrays. Tested with numpy arrays and netCDF4 variables. You could probably plug xarray arrays in and have it “just work” — though there may be some fancy indexing differences.

Xarray is all about data the conforms to the common data model https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/CDM/,

Which is to say rectangular arrays.

Unstructured conforms to this data model and can be stored in netCDF files. Ipso facto, xarray can work with such data.

Sure — just like numpy can.

From xarray's point of view, datasets from unstructured model output netCDF files would have dimensions time, depth, cell_index (3D),

Sure - but you don’t “cell_index” is a worthless “dimension” if you don’t understand the structure of the grid.

It’s not that Xarray can’t be used to provide storage (and file I/O etc), it’s that the higher level analysis study, where you need to understand the data more isn’t there.

I would love someone to plug xarray into gridded and see how it works — it may buy you some useful functionality out of the box.

visualizing the data in geographic coordinates
aggregating over geographic coordinates (e.g. zonal mean)
regridding
performing operations involving neighboring gridpoints (e.g. finite differences)

Exactly - xarray can only help you do these things for logically rectangular data — and it doesn’t (yet) have the hooks to do it for other structures.

It is crucial to recognize that challenges 1 and 2 also arise with curvilinear orthogonal grids, which many of us are using every day with xarray

Right — which is what I mean be getting some functionality out of the box. But logically rectangular grids are a lot easier to deal with with the array data model.

A gridded Variable holds:

The data itself ( as a numpy array like object)

Metadata (attributes)

A Grid object Optionally a Time object Optionally a Depth object

Xarray holds the data and attributes, and IIUC, some info about shared coordinates.

The idea is that the Variable abstracts the horizontal, vertical, and time coordinates for you. The “cell” information is there if you want it, but you don’t need to know about it to do a lot of analysis.

So I think gridded (or a gridded-like system) could be pretty easily built off xarray arrays — but I’m still not that clear on the details—would a gridded Variable be a subclass of xarray? Would it use an xarray as the data container only?

Anyway, the goal is an abstraction of the entire grid concept — I think that would be useful for a lot of applications.

Maybe we should look at the xgcm and gridded APIs side by side and see what commonality is there. xgcm didn’t exist when we started gridded, and we started with unstructured grids, but when I looked at what was available at the time, everything had logical rectangularness built in to the API.

unstructured data shares many of the same problems with regular structured curvilinear grids

Yup

xarray is a reasonable foundation for any data that can be stored in netCDF files

Sure

to optimize the sustainability of all our work, we should try to specialize as late as possible,

Also sure.

pushing as much common functionality as we can into the lower layers of the stack

Absolutely.

-CHB

ChrisBarker-NOAA commented 5 years ago

On Wed, Feb 27, 2019 at 10:43 AM Ryan Abernathey notifications@github.com wrote:

I agree that xarray is a great framework to start with. But one fundamental challenge xarray needs to deal with before it can be a universal solution for both structured and unstructured grids is that it needs to cope with 2D coordinates—i.e. lat/lon are allowed to be functions of multiple dimensions but still serve as fundamental coordinates for other variables.

I agree with you 100%. That's why I started a pull request on xarray https://github.com/pydata/xarray/pull/2405 to make this exact fix. I have not been able to finish the PR however. Since many people agree that this is a high priority, perhaps we can coordinate and find some developer time to complete that work.

I wonder if this really is an xarray issue to deal with -- or if it belongs at a higher layer -- it is still building in assumptions of logical rectangularness.

This is what Stephan and I were talking about at SciPy last year -- more of an abstraction of coordinates, as they don't necessarily map directly to indices -- in either 1 or 2D.

Even if it does belong in xarray, we'll still need a more abstract concept for unstructured grids.

-CHB

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

rabernat commented 5 years ago

I have tentatively picked a final option from the doodle poll:

Mar 4 (MON): 11:00 AM - 12:00 PM EST

Sincere apologies to those who can't make it. I will circulate an agenda soon, and we will post notes from our discussion.

Given the high number of attendees (17), we will use Zoom instead of appear.in.

Ryan Abernathey is inviting you to a scheduled Zoom meeting.

Topic: Python Grid Discussion Time: Mar 4, 2019 11:00 AM Eastern Time (US and Canada)

Join Zoom Meeting https://columbiauniversity.zoom.us/j/640510854

One tap mobile +16468769923,,640510854# US (New York) +16699006833,,640510854# US (San Jose)

Dial by your location +1 646 876 9923 US (New York) +1 669 900 6833 US (San Jose) Meeting ID: 640 510 854 Find your local number: https://zoom.us/u/abcuAxbTMF

djhoese commented 5 years ago

Really interesting design started here. I'm excited to see how this discussion could evolve in to a full geoxarray package or similar. I've been wanting to add an xarray accessor to pyresample for a while now too and this may be the push I need to do it (pyresample needs new interfaces in general).

ChrisBarker-NOAA commented 5 years ago

@djhoese: thanks for the link. Seeing that, this draws my attention:

"Interface directly with xarray, utilizing coordinates defined on DataArray objects."

So we are dead in the water with unstructured grids.

Which is why we need to figure out how to either make xarray aware of a more abstract concept of coordinates, or keep xarray lower level building block.

This looks a bit like the top-down approach I was referring too earlier -- "how to make a regridding lib" . What I'd like to see is using these use cases to define what we need in lower-level tools. For example reading that doc, I see:

(I'm not entirely sure what "remap" means, but guessing):

"Provide a remap-to interface (da.remap.remap_to([('lat', 0.5), ('lon', 0.75)], how='nearest'))"

So that tells me that you need a object underneath this that "understands" 'lat' and 'lon' directly -- rather than indexes -- and with an abstraction that provides the same interface regardless of grid type.

Also -- we are going to need interpolation routines specific to grid type that provide the same interface.

"or just wrap existing remapping lib"

that may be gridded -- plug xarray arrays in there instead of numpy arrays or netCDF4 variables, and you've got a good start :-)

Let's talk more on Monday -- I may be missing something here.

By the way, I've been pushing gridded here -- it's my baby -- but I want to be clear -- the API and code of gridded is less than ideal, and may not suite anyone else's use case -- but what I'm really trying to push is the level of abstraction -- not any particular API or code.

Oh, and another lib to put on the list:

https://github.com/NOAA-ORR-ERD/cell_tree2d

cell_tree2d is a high performance cell location code for 2-d unstructured and curvilinear grids.

gridded uses it for cell location, and we kept it a separate lib, so others could use it by itself. I suppose it should go into scipy.spatial, but it's maybe a bit specialized for that.

We're pretty sure it's faster and/or has a lower memory footprint than any other cell location code out there.

pangeo-data / pangeo

let's enumerate all the ways to represent a "grid" in python #356