Open JiaweiZhuang opened 7 years ago
See also #1079 and #1079 (comment)
Thanks! The idea of NDIntervalIndex
mentioned at pandas-dev/pandas#7640 comment seems powerful but too complicated to implement? Could there be a simpler way to hook the boundary attribute to DataArray?
I don't think we need a full NDIntervalIndex
unless we also want indexing, which is nice but not essential for just storing data. We do need a way to represent interval data in 1D arrays, though.
Probably the simplest option is to use structured dtypes, which should already work with the existing version of xarray, e.g.,
import numpy as np
import xarray
interval_dtype = np.dtype([('start', float), ('stop', float)])
coords = {'x': 0.5 + np.arange(3), 'x_bounds': ('x', np.array([(0, 1), (1, 2), (2, 3)], dtype=interval_dtype))}
da = xarray.DataArray(range(3), coords=coords, dims='x')
>>> da
<xarray.DataArray (x: 3)>
array([0, 1, 2])
Coordinates:
* x (x) float64 0.5 1.5 2.5
x_bounds (x) [('start', '<f8'), ('stop', '<f8')] (0.0, 1.0) (1.0, 2.0) ...
>>> da.x_bounds
<xarray.DataArray 'x_bounds' (x: 3)>
array([(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)],
dtype=[('start', '<f8'), ('stop', '<f8')])
Coordinates:
* x (x) float64 0.5 1.5 2.5
x_bounds (x) [('start', '<f8'), ('stop', '<f8')] (0.0, 1.0) (1.0, 2.0) ...
>>> da.x_bounds.data['start'], da.x_bounds.data['stop']
(array([ 0., 1., 2.]), array([ 1., 2., 3.]))
We could probably do a few things to make these easier to use:
da.x_bounds['start']
to return da.x_bounds.data['start']
wrapped in an xarray.DataArray
.Conceptually, this is pretty similar to a MultiIndex (see https://github.com/pydata/xarray/pull/1426 for discussion).
Probably the simplest option is to use structured dtypes, which should already work with the existing version of xarray, e.g.,
Thanks, that's a nice trick! Supporting da.x_bounds['start'] will definitely be helpful!
However, I am still concerned about 2D boundaries. Using the structured data type, 2D bounds will be an array of size (Nx,Ny,4) instead of (Nx+1,Ny+1). Although this matches the CF convention, it takes 4x memory and needs to be converted back to (Nx+1,Ny+1) for pcolormesh()
. Not a big problem though. I will be happy to go this way if (Nx+1,Ny+1)-sized bounds cannot be implemented.
These are precisely the sort of issues we are trying to solve with xgcm. I am about to make a big new release. Using the xgcm concept of an Axis
object (not yet in the online docs until the new release), it should be pretty easy to add this sort of plotting support in an arbitrary number of dimensions.
cc @adcroft, who expressed interest in this topic.
I'm just pinging this issue again to keep it fresh.
I am becoming more and more convinced that we need to allow for cell bounds in xarray's data model. Contrary to my comments above, I no longer think this is a problem to be solved with xgcm or some outside package.
CF conventions, which we partially support in other parts of xarray, have a clearly defined concept of cell geometry. When present, such coordinates could decoded and used for indexing and plotting.
Currently we distinguish between "dimension coordinates," which are converted to indexes, and "non-dimension coordinates." What if we added a new type of coordinate called "cell coordinates"? We could accomodate either (N+1) sized coordinates for quad-mesh geometries of (N,M) sized coordinates for unstructured meshes.
What is a concrete first step we could take towards this goal? Try to work out a design document?
Currently we distinguish between "dimension coordinates," which are converted to indexes, and "non-dimension coordinates."
The long term plan in https://github.com/pydata/xarray/issues/1603 ("Explicit indexes") is to eliminate this distinction -- we'll simply have variables, which can be in the form of data variables or coordinates, and indexes, for look-up along any coordinate.
What if we added a new type of coordinate called "cell coordinates"? We could accomodate either (N+1) sized coordinates for quad-mesh geometries or (N,M) sized coordinates for unstructured meshes.
I understand (N+1) sized coordinates for quad-mesh geometries, where N is the number of physical dimensions.
I'm not sure I understand (N,M) sized coordinates for unstructured meshes -- what is M here? The total number of cells? Some arbitrary constant indicating the maximum number of sides for a single cell?
I do.
Logically I see two approaches here:
(1) feels like the safe approach (from xarray's perpsective). Maybe structured dtypes too annoying to use on a routine basis, but there also are other use cases for them that would benefit from some attention. I worry that solutions in the style of (2) would bake domain specific logic deep into xarray's data model and make the whole library more complex, though I do appreciate that cell bounds are a pretty ubiquitous concept for modeling physical phenomena.
One way of solving (2) would be to allow something like "isolated" or "non-aligned" dimensions, which aren't shared across a Dataset/DataArray and are allowed to deviate on a per-variable basis. Dataset.dims
would be a dynamic (rather than computed) part of xarray's data model, and dimensions not found in dims
would not be required to be aligned/consistent between variables. This is intriguing but is also a much bigger change:
dims=None
), dims
would get filled in from all the variables in a Dataset. But the aligned dimensions in dims
could also be set explicitly.dims
, you can't index or align along it and it's allowed to vary between variables.I'm not sure I understand (N,M) sized coordinates for unstructured meshes -- what is M here? The total number of cells? Some arbitrary constant indicating the maximum number of sides for a single cell?
N is the number of cells. M is the number of points required to specify the cell vertices, e.g. 4 for 2D quadmesh, 3 for 2D trimesh, 8 for 3D quadmesh, etc.
Regarding your options 1 or 2, I guess I'm agnostic as to how it is implemented. I recognize 2 introduces lots of complications. What matters is how it will interact the indexes, i.e. can we easily select data based on cell bounds?
I will have to take some time to think about what you wrote, as it is hard for my brain... 🙃
What matters is how it will interact the indexes, i.e. can we easily select data based on cell bounds?
Either way, we will need to write our own index classes for this (but this is totally doable). This will either be something xarray specific or possibly based on pandas.Index
.
pandas.IntervalIndex
is similar, but is much more complex because it handles overlapping cells. We would prefer a CellIndex
that does not allow for overlap.
Not sure where this stands but another advantage might be the ability to call xr.open_dataarray
on netcdf files containing individual variables plus coordinate bounds (data from CMIP5/6 are commonly stored this way).
Has there been any progress on this?
Recently I experimented with an (incomplete) duck-array prototype, wrapping an array of length N+1 in a duck array of length N (such that you can use it as a coordinate for a DataArray
of length/shape N). It mostly worked (even though there may be some issues when you want to use it as an xarray index).
See https://github.com/scipp/scippx/blob/main/src/scippx/bin_edge_array.py (there is a bunch of unrelated stuff in the repo, you can mostly ignore that).
xref a possible solution explained here: https://github.com/pydata/xarray/issues/8005#issuecomment-1691619253
Basically, it is very similar than @shoyer's https://github.com/pydata/xarray/issues/1475#issuecomment-314844258. In addition, the bounds coordinate would be indexed (and would share its Xarray index with the point-value coordinate).
The bounds coordinate would wrap a pd.IntervalArray
(or pd.IntervalIndex
?), but we could also have our own, simpler implementation (no bounds overlap) as suggested in https://github.com/pydata/xarray/issues/1475#issuecomment-457951491.
Cell boundaries can be either N+1 sized arrays as suggested by xgcm/xmitgcm#15, or (N,2) sized arrays as suggested by the CF convention. However, a DataArray cannot hold both kinds of coordinate variables because they contain a new dimension.
If you try to assign a new coordinate to a DataArray by
dr.assign_coords()
, you will getValueError: cannot add coordinates with new dimensions to a DataArray
On the other hand, if your DataSet contains cell boundary variables (for example, #667), the bounds will be dropped when you extract a single variable into a DataArray.
Having cell bounds available in a DataArray is important for a couple of applications:
Pass cell bounds to DataArray's plotting methods (xgcm/xmitgcm#15). I am aware of the discussion about inferring boundaries (#781). However, for the Cube-Sphere grid or the Lat-Lon-Cap grid (reference) which have tiles covering the poles, I have to explicitly pass cell bounds to the original plt.pcolormesh() to get a good-looking plot. (see this comment for details)
For conservative (i.e. area-weighted) regridding (mentioned in #486). Cell centers are enough for bilinear interpolation or other simple resamping, but for any Finite-Volume meshes, knowing the boundaries is crucial if you want to conserve the total amount of mass or flux.
Plotting or regridding will work fine if you pass cell bounds as an additional argument to a wrapper function. However, having a single DataArray object containing boundary information seems like a more elegant solution. Is it possible to let DataArray accept N+1 sized coordinate variables, and be able to inherit them from the parent DataSet? If that's too drastic, is it possible to write an accessor to extend DataArray's capability? Say, a "bound" accessor for a new attribute
ds.bnd['lat_b']
, which can be kept when a DataArray gets extracted (ds['data_var'].bnd['lat_b']
)? Does this make sense?