pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

Support Two-Dimensional Coordinate Variables #605

Closed jhamman closed 8 years ago

jhamman commented 9 years ago

The CF Conventions supports the notion of a 2d coordinate variable in the case of irregularly spaced data. An example of this sort of dataset is below. The CF Convention is to add a "coordinates" attribute with a string describing the 2d coordinates.

dimensions:
    xc = 128 ;
    yc = 64 ;
    lev = 18 ;
variables:
    float T(lev,yc,xc) ;
        T:long_name = "temperature" ;
        T:units = "K" ;
        T:coordinates = "lon lat" ;
    float xc(xc) ;
        xc:axis = "X" ;
        xc:long_name = "x-coordinate in Cartesian system" ;
        xc:units = "m" ;
    float yc(yc) ;
        yc:axis = "Y" ;
        yc:long_name = "y-coordinate in Cartesian system" ;
        yc:units = "m" ;
    float lev(lev) ;
        lev:long_name = "pressure level" ;
        lev:units = "hPa" ;
    float lon(yc,xc) ;
        lon:long_name = "longitude" ;
        lon:units = "degrees_east" ;
    float lat(yc,xc) ;
        lat:long_name = "latitude" ;
        lat:units = "degrees_north" ;

I'd like to discuss how we could support this in xray. There motivating application for this is in plotting operations but it may also have application in other grouping and remapping operations (e.g. #324, #475, #486).

One option would just to honor the "coordinates" attr in plotting and use the specified coordinates as the x/y values.

ref: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.6/build/cf-conventions.html#idp5559280

shoyer commented 9 years ago

We do support multi-dimensional coordinates already, and even read the coordinates attributes. However, we don't do anything with them, beyond keeping them around: http://xray.readthedocs.org/en/stable/data-structures.html#coordinates

It's definitely worth seeing if we can make ds.T.plot.contourf('lat', 'lon') work, though.

jhamman commented 9 years ago

Okay, somehow I missed that we're already doing this to some extent. I just ran across the code in conventions.py and, yes, we don't do anything with them.

@clarkfitzg - do you have any thoughts on how this could work?

shoyer commented 9 years ago

We could probably modify our contour/contourf wrapper to accept 2D arrays for X/Y. The matplotlib functions already handle all the necessary logic.

clarkfitzg commented 9 years ago

Are the coordinates not dimensions of those DataArrays? Would it be better to make them dimensions when they get read in?

shoyer commented 9 years ago

These latitude and longitude coordinates can't be dimensions because they need to be 2D. A good example would be simulation data on some sort of projected grid (like in @jhamman's example above).

jhamman commented 9 years ago

@shoyer - the idea of the coordinates convention is that we all ready know the x/y coordinate names. So we shouldn't need to supply them like this: ds.T.plot.contourf('lat', 'lon'). I'm thinking it would be better to store the coord names coming from https://github.com/xray/xray/blob/master/xray/conventions.py#L882 in each DataArray. Maybe a DataArray.coord_keys = ('xc', 'yc') attribute is in order? Currently, we aren't really using the coordinates attribute in its intended way.

In the plotting logic, we should just search to see if coord_keys is not None before passing the coordinates to the plotting method.

shoyer commented 9 years ago

My hesitation with adding coord_keys is two-fold:

  1. This would make the xray data model more complex. I would really prefer to avoid this.
  2. It's not clear to me that the order of names in the coordinates attribute is meaningful according to CF conventions: "The value of the coordinates attribute is a blank separated list of the names of auxiliary coordinate variables. There is no restriction on the order in which the auxiliary coordinate variables appear in the coordinates attribute string."

A better way to handle this might be to write a xray-cf packages (see #461 for another use case), which registers the .cf namespace on Dataset/DataArray objects. Then we could have ds.cf.plot, which automatically plots according to CF conventions, and could even do things like automatically plot with cartopy if relevant. This means less domain specific details in xray, and could be a good model for other similar extension projects (e.g., xgcm).

jhamman commented 9 years ago

@shoyer - For the most part, I'm in agreement here with you. A few comments (reference ncdump of a irregular grid netCDF below).

  1. Yes, simpler data model makes sense. However, I think we are missing a few pieces of the cf coordinates data model in xray. ni and nj are dimensions, not coordinates but xray turns them into coordinates. I know why we do this in practice but, for example, frac(nj, ni) only uses nj and ni, yc and xc are it's coordinates. So, you see, there is a bit of a disconnect there. Another way to think about xray indexing is that isel should work relative to the dimension index, and sel should work on the coordinates. I know I'm out on a limb here...
  2. After reading the conventions more carefully, I totally agree with your assessment and concern about the coordinates attribute. Not restricting the order of the coordinates is a big deficiency here.
  3. The main thing I don't like about how we handle the coordinate attribute right now is that we use it to determine the coordinate names, and throw the attribute out. I seem to remember bringing this up in the past with our discussions of the calendar attribute and still don't like that we remove attributes that we decode. Especially this once since we don't do much with it.

I have two related PRs from today just to show which way we could go with all this: #606, #608.

ncdump of domain file

(xray_dev)Joes-iMac$ ncdump -h domain.lnd.wr50a_ar9v4.100920.nc 
netcdf domain.lnd.wr50a_ar9v4.100920 {
dimensions:
    n = 56375 ;
    ni = 275 ;
    nj = 205 ;
    nv = 4 ;
variables:
    double xc(nj, ni) ;
        xc:long_name = "longitude of grid cell center" ;
        xc:units = "degrees_east" ;
        xc:bounds = "xv" ;
    double yc(nj, ni) ;
        yc:long_name = "latitude of grid cell center" ;
        yc:units = "degrees_north" ;
        yc:bounds = "yv" ;
    double xv(nj, ni, nv) ;
        xv:long_name = "longitude of grid cell verticies" ;
        xv:units = "degrees_east" ;
    double yv(nj, ni, nv) ;
        yv:long_name = "latitude of grid cell verticies" ;
        yv:units = "degrees_north" ;
    int mask(nj, ni) ;
        mask:long_name = "domain mask" ;
        mask:note = "unitless" ;
        mask:coordinates = "xc yc" ;
        mask:comment = "0 value indicates cell is not active" ;
    double area(nj, ni) ;
        area:long_name = "area of grid cell in radians squared" ;
        area:coordinates = "xc yc" ;
        area:units = "radian2" ;
    double frac(nj, ni) ;
        frac:long_name = "fraction of grid cell that is active" ;
        frac:coordinates = "xc yc" ;
        frac:note = "unitless" ;
        frac:filter1 = "error if frac> 1.0+eps or frac < 0.0-eps; eps = 0.1000000E-10" ;
        frac:filter2 = "limit frac to [fminval,fmaxval]; fminval= 0.1000000E-02 fmaxval=  1.000000" ;

// global attributes:
        :title = "CCSM domain data:" ;
        :Conventions = "CF-1.0" ;

xray representation of domain file

In [1]: import xray

In [2]: xray.open_dataset('domain.lnd.wr50a_ar9v4.100920.nc')
Out[2]: 
<xray.Dataset>
Dimensions:  (ni: 275, nj: 205, nv: 4)
Coordinates:
    xc       (nj, ni) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 ...
    yc       (nj, ni) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
  * ni       (ni) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * nj       (nj) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * nv       (nv) int64 0 1 2 3
Data variables:
    xv       (nj, ni, nv) float64 189.3 189.4 189.2 189.0 189.4 189.6 189.3 ...
    yv       (nj, ni, nv) float64 16.33 16.58 16.74 16.49 16.58 16.82 16.98 ...
    mask     (nj, ni) int32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
    area     (nj, ni) float64 2.579e-05 2.594e-05 2.611e-05 2.627e-05 ...
    frac     (nj, ni) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
Attributes:
    title: CCSM domain data:
    Conventions: CF-1.0
shoyer commented 9 years ago

The main thing I don't like about how we handle the coordinate attribute right now is that we use it to determine the coordinate names, and throw the attribute out. I seem to remember bringing this up in the past with our discussions of the calendar attribute and still don't like that we remove attributes that we decode. Especially this once since we don't do much with it.

In the case of calendars, the information ends up on the encoding attribute. I think you're right though that we currently throw away variable specific coordinates information -- that should probably also be preserved on encoding.

There are two reasons why we do this sort of thing:

  1. to make it unambiguously clear that we've "decoded" this information into xray's data model.
  2. to ensure that the metadata can't get out of sync by being changed in one place but not another. For example, what would it mean if we unset xc as a coordinate on the dataset, but it remained in the coordinates attribute on mask? Or if we deleted xc entirely?

These sort of ambiguous scenarios are the part that worry me. In particular, I worry that it leads us down a path of maintaining arbitrary metadata through operations.

jhamman commented 9 years ago

In the case of calendars, the information ends up on the encoding attribute.

You are right, I had forgotten about this. My original statement has been amended accordingly.

I think we're square on my points 2 and 3. Does my first point make sense?

I'll open an issue regarding the coordinates encoding attribute.

jhamman commented 9 years ago

closed via #608 and #610.