Bounds are not dimensions in CF Metadata Conventions files #48

Open pvanlaake opened 10 months ago

pvanlaake commented 10 months ago

CF Metadata Conventions use NetCDF files with a sophisticated set of conventions to ease interpretation and analysis of the data. One such convention is to include a bounds attribute with each dimension when the data represents cells (rather than point observations on a regular grid) to indicate the boundaries of the cell along the dimension. These bounds are included in the file as 3-D arrays for lon and lat (including the time dimension for reasons unknown to me) and a 2-D array for time, with an additional first dimension called bnds. As per the CF documentation, "a boundary variable is considered to be part of a coordinate variable’s metadata" and it is thus not a dimension. This is made clear also by the fact that for "dimension" bnds coord_dim == FALSE. See below example using tidync:

> huss <- tidync(lf[1])
> huss

Data Source (1): huss_day_EC-Earth3-CC_historical_r1i1p1f1_gr_19910101-19960321_v20210113.nc ...

Grids (7) <dimension family> : <associated variables> 

[1]   D1,D2,D0 : lat_bnds    **ACTIVE GRID** ( 976384  values per variable)
[2]   D1,D3,D0 : lon_bnds
[3]   D3,D2,D0 : huss
[4]   D1,D0    : time_bnds
[5]   D0       : time
[6]   D2       : lat
[7]   D3       : lon

Dimensions 4 (3 active): 

  dim   name  length     min     max start count    dmin    dmax unlim coord_dim 
  <chr> <chr>  <dbl>   <dbl>   <dbl> <int> <int>   <dbl>   <dbl> <lgl> <lgl>     
1 D0    time    1907 51500.  53406.      1  1907 51500.  53406.  FALSE TRUE      
2 D1    bnds       2     1       2       1     2     1       2   FALSE FALSE     ## <<<<<<<
3 D2    lat      256   -89.5    89.5     1   256   -89.5    89.5 FALSE TRUE      

Inactive dimensions:

  dim   name  length   min   max unlim coord_dim 
  <chr> <chr>  <dbl> <dbl> <dbl> <lgl> <lgl>     
1 D3    lon      512     0  359. FALSE TRUE  

> huss$attribute |> filter(name == "bounds") |> unnest(value)
# A tibble: 3 × 4
     id name   variable value    
  <int> <chr>  <chr>    <chr>    
1     0 bounds time     time_bnds
2     0 bounds lat      lat_bnds 
3     0 bounds lon      lon_bnds 

Would it be a good idea to drop the bnds "dimension" and thus the associated grids? There should be some other mechanism to keep them on, however, such that their contents can be accessed.

mdsumner commented 10 months ago

huh, well there you go, I never really understood that - I thought there were cases where the corner coordinates are stored explicitly

I'll have to look at a few cases and get reprexes, here's a couple:

src <- "https://dapds00.nci.org.au/thredds/dodsC/ua6_4/CMIP5/derived/CMIP5/GCM/native/CSIRO-BOM/ACCESS1-3/rcp45/day/atmos/Amon/r1i1p1/latest/sfcWind/aggregates/sfcWind_Amon_ACCESS1-3_rcp45_r1i1p1_2015-2034-monMax-seasmax-clim_native.nc"
src <- "https://dapds00.nci.org.au/thredds/dodsC/rr6/oceanmaps_datasets/roms/eac/his_2023_03_09_84685.nc"
pvanlaake commented 10 months ago

Where do you get all this freakish data?!? Must be too much Vegemite down under!

The surface wind file conforms to what I observed before but the false dimension is called nb2. The false grids are still easily identified:

> sfc$attribute |> filter(name == "bounds") |> unnest(value)
# A tibble: 2 × 4
     id name   variable value   
  <int> <chr>  <chr>    <chr>   
1     4 bounds lon      lon_bnds
2     4 bounds lat      lat_bnds

Otherwise it's a funny file with all these monthly, seasonal and annual variables.

The ocean file is truly scary. This seems to be a file for intermediate use by people who are intimately familiar with this presentation of data. The false dimensions, like xi_u and eta_u, seem to have some meaning, if you know the formulae that relate to the (u,v) components of wind fields, because they yield different results for s_rho and ocean_time on the hyper_array of variable u, but having no values that are revealed by dimension variable data or local or global attributes. Simply dropping the false dimensions is an obvious bad choice here: variables u and v would no longer show up.

So I guess my issue should for now be considered a nuisance that is a feature of various data sets rather than a loose end that needs a fix.

mdsumner commented 10 months ago

oh right yes sorry I just went for a hard core example because I didn't otherwise know how to find something quickly

it's ocean model output, and doesn't get much more complicated 😄

I'm confused about what a coord dim is, and how bounds can be expressed - I think I was just wrong about this

a coord dim is just one that has axis values in a var, right, I think we can fix this pretty easily but I need to warm up a bit 👌

pvanlaake commented 10 months ago

I always interpreted a coord_dim as a flag to indicate if a variable contains the values of some dimension (TRUE) or whether it is a true variable on a grid whose values represent some physical property (FALSE). Unless I am very mistaken, this is how package ncdf4 puts dimension values in the vals property of each dimension in its ncdf4 class.

mdsumner commented 10 months ago

what about this file? are these not bounds in that strict way? I see you've said as much about the wind file above now :)

if ncdump says it's a dimension then I'm unclear what one is supposed to do about it otherwise


ncdump -h dt_global_allsat_phy_l4_20200603_20201126.nc
netcdf dt_global_allsat_phy_l4_20200603_20201126 {
        time = 1 ;
        latitude = 720 ;
        longitude = 1440 ;
        nv = 2 ;
mdsumner commented 10 months ago

