hypertidy / ncmeta

Tidy NetCDF metadata
https://hypertidy.github.io/ncmeta/
11 stars 5 forks source link

do Coordinates properly #9

Closed mdsumner closed 5 years ago

mdsumner commented 5 years ago

http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/cf-conventions.html#coordinate-system

Also see discussion and examples here: https://github.com/r-spatial/stars/issues/62#issuecomment-426991912

mdsumner commented 5 years ago

We could have a new function nc_coords that identifies these relationships, similar to how nc_grids pulls out this implicit relation between variables on the shared spaces - it would identify links between pairs of grids.

dblodgett-usgs commented 5 years ago

What would you imagine the response from such a function would be? I need to build a function that will find the coordinate variables or auxiliary coordinate variables for a given data variable. -- not rocket science, but I'm not really sure what would be generally useful.

Just doodling,

nc_coords(nc, variable) { 

# logic logic logic

return(list(dim_name = coordvar_name, dim_name = coordvar_name, ... ))
}

Considerations: 1) dimension order -- this is really important and can be a big stumbling block. I've developed a little trick to work with variable dimension order that basically enforces a package-standard dimension order. In RNetCDF parlance it looks something like:

dimid_order <- match(nc_var_info$dimids,
                     c(x_var_info$dimids,
                       y_var_info$dimids,
                       t_var_info$dimids))

data <- var.get.nc(nc, variable_name,
                   start = c(min(x_inds),
                             min(y_inds), i)[dimid_order],
                   count = c(length(x_inds),
                             length(y_inds), 1)[dimid_order])

2) Some NetCDF files will have both 2d lat/lon (auxiliary) coordinate variables and 1d x/y coordinate variables (see daymet for example). My tendancy is to attempt to treat a grid as a raster and ignore the 2d lat/lon if possible, but there are some datasets that ONLY include the 2d lat/lon. (see stageiv precip for example)

Let me know what you think on these and I'll hack together a PR that gets this info pulled together for a few test datasets.

mdsumner commented 5 years ago

I don't generally know how to find the 2d coordinates of a 3d or 4d var, in practical terms I set a default and let the user override. This works really well in angstroms and I recently added ability to plot with "coords" to quadmesh::mesh_plot. Logic to find the coords is what I'm after. I have a transpose argument in angstroms too, not sure if it can be more automated.

mdsumner commented 5 years ago

Oh, I'd actually never noticed this little "coordinates" attribute before, it's all there (this from a ROMs model I have):

  float temp[xi_rho,eta_rho,s_rho,ocean_time]   
            long_name: potential temperature
            units: Celsius
            time: ocean_time
            coordinates: lon_rho lat_rho s_rho ocean_time
            field: temperature, scalar, series
            _FillValue: 9.99999993381581e+36
        float salt[xi_rho,eta_rho,s_rho,ocean_time]   
            long_name: salinity
            time: ocean_time
            coordinates: lon_rho lat_rho s_rho ocean_time
            field: salinity, scalar, series
            _FillValue: 9.99999993381581e+36

I'll have a bit more of a look and come back.

dblodgett-usgs commented 5 years ago

Yeah. Trick is, it's there when it is but many files don't include it.

Usually it's because the old COARDS specification said you could name a variable the same as a dimension to declare that it is a "coordinate variable". Looks like what you've got for nc_grids is right in line with what NetCDF Java has for their GRID data type -- I'm pretty sure xarray basically adopted this kind of model as well. Adding some smarts to look at the dimension name / variable name then the coordinates attributes of variables to figure out what all the nc_grids are, makes some sense.

Yet another side note here is that projections (when present) are always going to be associated with data variables and not the coordinate variables. So basically any code will need to start from data variables and spider out from there. Guessing you already know this, but it's worth mentioning.

mdsumner commented 5 years ago

All helpful! Fact is I have to think very hard about these topics, and always lean on using many examples to make sure I get it right. That's really interesting about projections.

Realizing that a grid is like a virtual table was a huge insight for me, the usual ncdump -h summary does not make that clear. GDAL has a lot this logic too, as must Panoply and Ferret - I don't think it exists in a clean module, but maybe xarray is something to look at. I don't have strong opinions about what should be done, yet.

dblodgett-usgs commented 5 years ago

I started playing around with this today. Opening a PR for review only -- I feel like I might be working around some of the stuff you've been doing otherwise. Still wrapping my head around the abstraction you've put in place here.

I realized I was over complicating things with the dimension order comment above... just did the X, Y, Z, T coordinate axes which are distinct from the actual dimensions of a netCDF file and should just be dealt with separately.

dblodgett-usgs commented 5 years ago

Re: #19 and #20, I come back to the last comment I made above. There is the basic relationship between a dimension and a coordinate-variable of the same name then there is canonical spatiotemporal axes that are required in order to map a given dataset onto some abstraction to be used for integration or whatever purpose (e.g. stars dimensions). CF "Coordinates" aim to support the latter using the former as one of a few tools. (noting that 2d coordinate variables don't work with dimension-name associations as an example)

My preference here is to have nc_coord_var address the canonical coordinates mapping to XYZT. Looking at nc_dims and nc_vars gets you to the fact that there is a 1d variable of the same name as a dimension so adding that information to the output of nc_coord_var seems superfluous?

In the interest of moving along I'll work that into a PR but feel free to decline if you want to take a different direction.

dblodgett-usgs commented 5 years ago

May want to consider bringing "../rasterwise/extdata/timeseries.nc" and "../rasterwise/extdata/high-dim/test-1.nc" in as just header info for tests in ncmeta. "../rasterwise/extdata/bad_examples_62/example3.nc" is also an edge case worth testing.

dblodgett-usgs commented 5 years ago

"../rasterwise/extdata/bad_examples_62/example3.nc"

is a situation where there are both auxiliary and normal coordinate variables. The way I support this now is by returning one row for the normal dimension-pair coordinate variables and another for the auxiliary coordinates. Seem reasonable?

# A tibble: 7 x 5
  variable X     Y     Z     T    
  <chr>    <chr> <chr> <chr> <chr>
1 time     NA    NA    NA    time 
2 lon      X     Y     NA    NA   
3 lat      X     Y     NA    NA   
4 X        X     NA    NA    NA   
5 Y        NA    Y     NA    NA   
6 pr       lon   lat   NA    NA   
7 pr       X     Y     NA    time