hypertidy / ncmeta

Tidy NetCDF metadata
https://hypertidy.github.io/ncmeta/
11 stars 5 forks source link

nc_axis row order #23

Open dblodgett-usgs opened 5 years ago

dblodgett-usgs commented 5 years ago

Based on this: https://github.com/hypertidy/ncmeta/blob/master/R/nc_axes.R#L46 the row order of the response from nc_axis is in the order of the RNetCDF::var.inq.nc()$dimid vector. Is this by design? I ask because that $dimid for a variable determines what order you put dimensions in start and count for RNetCDF::var.get.nc() and it's pretty important to get right!!

I wonder if adding a list column to the nc_var() response that would contain the $dimid vector would be safer? The fact that the row order corresponds dimension order is being used here: https://github.com/r-spatial/stars/blob/master/R/ncdf.R#L58 already so at the least there should be some testing to make sure that the row order doesn't get mucked up?

var <- "ppt"
h_nc <- RNetCDF::open.nc("https://cida.usgs.gov/thredds/dodsC/prism_v2")

dplyr::filter(nc_axes(h), variable == var)$dimension
# [1] 1 0 3

RNetCDF::var.inq.nc(h_nc, var)$dimids
# [1] 1 0 3
mdsumner commented 5 years ago

I think it's correct this way. I'm not sure about by design, but I definitely checked these things early on.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ncmeta)
h_nc <- RNetCDF::open.nc("https://cida.usgs.gov/thredds/dodsC/prism_v2")

nc_axes(h_nc, variables = "ppt") %>% 
 dplyr::left_join(nc_dims(h_nc), c("dimension" = "id"))
#> # A tibble: 3 x 6
#>    axis variable dimension name  length unlim
#>   <int> <chr>        <dbl> <chr>  <dbl> <lgl>
#> 1     1 ppt              1 lon     1405 FALSE
#> 2     2 ppt              0 lat      621 FALSE
#> 3     3 ppt              3 time    1476 FALSE

Created on 2019-03-20 by the reprex package (v0.2.1)

dblodgett-usgs commented 5 years ago

Is row order of the nc_axes tibble enough? I'm cool if we want to go with that convention. It should be advertised in docs and tested at the least. I can take a crack at that when I look at nc_coord_var() too.

mdsumner commented 5 years ago

This the logic of axes:

f <- "https://cida.usgs.gov/thredds/dodsC/prism_v2"
x <- tidync(f)

x

Data Source (1): prism_v2 ...

Grids (5) <dimension family> : <associated variables> 

[1]   D1,D0,D3 : ppt, tmx, tmn    **ACTIVE GRID** ( 1287817380  values per variable)
[2]   D2,D3    : time_bnds
[3]   D0       : lat
[4]   D1       : lon
[5]   D3       : time

Dimensions 4 (3 active): 

  dim   name  length     min      max start count    dmin     dmax unlim coord_dim 
  <chr> <chr>  <dbl>   <dbl>    <dbl> <int> <int>   <dbl>    <dbl> <lgl> <lgl>     
1 D0    lat      621    24.1     49.9     1   621    24.1     49.9 FALSE TRUE      
2 D1    lon     1405  -125.     -66.5     1  1405  -125.     -66.5 FALSE TRUE      
3 D3    time    1476  9131    54025       1  1476  9131    54025   FALSE TRUE      

Inactive dimensions:

  dim   name  length   min   max unlim coord_dim 
  <chr> <chr>  <dbl> <dbl> <dbl> <lgl> <lgl>     
1 D2    tbnd       2     1     2 FALSE FALSE     

See how the variables are D1, D0, D3, but the dimensions are in order of their ID. (I think that table of active dimensions should be in D1, D0, D3 order too - something to fix, but I want it to not be all cat/print anyway).

From the perspective of ncmeta,

we get all dimensions in source order

ncmeta::nc_dims(f)
# A tibble: 4 x 4
     id name  length unlim
  <dbl> <chr>  <dbl> <lgl>
1     0 lat      621 FALSE
2     1 lon     1405 FALSE
3     2 tbnd       2 FALSE
4     3 time    1476 FALSE

We get axes (instances of dimensions) for variables in coordinate order, so axis here is the 1, 2, 3 order.

ncmeta::nc_axes(f, "ppt")
# A tibble: 3 x 3
   axis variable dimension
  <int> <chr>        <dbl>
1     1 ppt              1
2     2 ppt              0
3     3 ppt              3

Loosely, the dimension id matches id in the dimension table - that's something to formalize, and I'm not sure how consistent I am. In places I use "name" on the entity table, not "variable" or "dimension", but there "axis" is not a name but an ordering, so ...

Frankly, I have gone around in circles a bit with this but somehow got it right last time. It takes me a while to get my head into it again - definitely needs to be pinned down with tests, so we can clean up the code somewhat.

There was a stage when I was unpacking the dimension strings "D1,D0,D3" but that's bad news, I really think axis is the right concept, and in context of a variable they are ordered (and labelled with an order attribute axis).

It also holds for all axes, but here the axis order label is absolute (not sure if that's good or bad):

ncmeta::nc_axes(f)
# A tibble: 14 x 3
    axis variable  dimension
   <int> <chr>         <dbl>
 1     1 lon               1
 2     2 lat               0
 3     3 time              3
 4     4 time_bnds         2
 5     5 time_bnds         3
 6     6 ppt               1
 7     7 ppt               0
 8     8 ppt               3
 9     9 tmx               1
10    10 tmx               0
11    11 tmx               3
12    12 tmn               1
13    13 tmn               0
14    14 tmn               3
dblodgett-usgs commented 5 years ago

Yeah -- I see what's going on now. You get the dimension order right in rows because of this: https://github.com/hypertidy/ncmeta/blob/cea1b1464d25a88e80388d6503abc0ea16d9ff2c/R/nc_axes.R#L59 This:

Browse[2]> out
$id
[1] 4

$name
[1] "ppt"

$type
[1] "NC_INT"

$ndims
[1] 3

$dimids
[1] 1 0 3

$natts
[1] 6

becomes:

Browse[2]> out
$id
[1] 4 4 4

$name
[1] "ppt" "ppt" "ppt"

$type
[1] "NC_INT" "NC_INT" "NC_INT"

$ndims
[1] 3 3 3

$dimids
[1] 1 0 3

$natts
[1] 6 6 6

Since dimids is the only part of the var.inq.nc that varies, the row order of the resulting tibble has the right order. It always will too.

My feeling is that it would be as good or better to put the dimid vector into a list column for each variable and drop the whole axis thing -- but maybe I'm missing the utility of the axis construct?

Happy to start writing some code around this way of doing things and just be aware of things being sorted in "axis" order too.

> s <- "https://cida.usgs.gov/thredds/dodsC/stageiv_combined"
> nc_axes(s)
# A tibble: 10 x 3
    axis variable                                        dimension
   <int> <chr>                                               <dbl>
 1     1 Total_precipitation_surface_1_Hour_Accumulation         2
 2     2 Total_precipitation_surface_1_Hour_Accumulation         3
 3     3 Total_precipitation_surface_1_Hour_Accumulation         0
 4     4 time_bounds                                             1
 5     5 time_bounds                                             0
 6     6 lon                                                     3
 7     7 lon                                                     2
 8     8 lat                                                     3
 9     9 lat                                                     2
10    10 time                                                    0