hypertidy / ncmeta

Tidy NetCDF metadata
https://hypertidy.github.io/ncmeta/
11 stars 5 forks source link

Extended metadata for "time" dimension and possible others #49

Closed pvanlaake closed 5 months ago

pvanlaake commented 10 months ago

NetCDF has a very well-defined structure for describing data that goes into the file, ncmeta already captures all this information nicely. There are several conventions based on the basic NetCDF structure for additional standards, with COARDS and the CF Metadata Conventions among the more popular ones. I would assume that >90% of NetCDF data that is publicly exchanged (meaning, where the data consumer is not familiar with the processing design of the data producer and thus dependent on the metadata and any conventions to interpret the data) follows one of these conventions. To me it would therefore make sense to try and capture some of these conventions to describe domain-specific concepts into ncmeta and thus make them more accessible to NetCDF consumers in R.

I would propose to call these "extended metadata" and to capture them in an additional tibble called (let me think hard here) extended that is attached to the result of passing a NetCDF resource through this package. For the "time" dimension used by both COARDS and the CF Metadata Conventions, notoriously hard to interpret due to its in-built flexibility, that would look somewhat like this:

nc_meta.NetCDF <- function(x, ...) {
  (...)

  atts <- nc_atts(x) ## safe to assume there are always some attributes - it's the raison d'etre of NetCDF

  ## Add time information for any "time" dimension. Since not all files have a 
  ## "calendar" attribute or "axis == "T"", just try to create a CFtime
  ## instance for any dimension variable with a "units" attribute and a 
  ## "calendar", if present. Build a tibble for extended metadata from 
  ## identified time dimensions.
  idx <- which(atts$name == "units")
  var_names <- atts$variable[idx]
  units <- unlist(atts$value[idx])
  ext <- tibble::tibble()
  for (i in seq_along(idx)) {
    if (nchar(units)[i] < 8) next
    cal <- unlist(atts$value[which(atts$variable == var_names[i] & atts$name == "calendar")])
    try({
      cft <- CFtime::CFtime(units[i], cal)
      ## we have a CFtime instance so now read actual offsets and add
      offsets <- as.vector(RNetCDF::var.get.nc(x, var_names[i]))
      cft <- cft + offsets

      ## add to the tibble
      if (nrow(ext) == 0) ext <- tibble::tibble_row(dim_id = dims$id[which(dims$name == var_names[i])], time = cft)
      else ext <- tibble::add_row(ext, dim_id = dims$id[which(dims$name == var_names[i])], time = cft)
    }, silent = TRUE)
  }

  structure(list(dimension = dims, 
       variable = vars, 
       attribute = atts,
       extended = ext,
       axis = axis,
       grid = nc_grids_dimvar(dims, vars, axis)),
       class = "ncmeta")
}

Alternatively, the extended attributes, like cft above, can be added as an additional column to the dims tibble.

The CFtime package used here can ingest the NetCDF attributes attached to a dimension that represents time and produce a vector of character strings that can be used as dimnames(), among several other things. It is currently integrated into the dev version of tidync but may actually be better placed here as it then becomes available to all packages that are using ncmeta.

A similar approach may be used for coordinate reference systems, or other such conventions out there.

mdsumner commented 10 months ago

this sounds good, so if I understand the existig tables wouldn't change, we'd created a new "extended" table that took the raw information and added convention-specific interpretation to the raw values?

pvanlaake commented 10 months ago

That's correct. I'll fork and start working on it.

pvanlaake commented 10 months ago

Hi Michael, have a look at my fork. I am not there yet to go for a PR but the basic functionality is there and it works. Problem is I got a little too excited and started editing other files, primarily to add full ncdf4 support. You may find a glitch here and there but this is what is does on a sunny day:

> my <- nc_extended("./some.nc")
> my
# A tibble: 3 × 3
  dimension name  time     
      <int> <chr> <list>   
1         0 y     <lgl [1]>
2         1 t     <CFtime> 
3         2 x     <lgl [1]>
> my |> filter(!is.na(time))
# A tibble: 1 × 3
  dimension name  time    
      <int> <chr> <list>  
1         1 t     <CFtime>
> my |> filter(!is.na(time)) |> pull(time)
[[1]]
CF datum of origin:
  Origin  : 2020-05-04 00:00:00
  Units   : days
  Calendar: proleptic_gregorian
CF time series:
  Elements: [2020-05-04 .. 2020-05-09] (average of 5.000000 days between 2 elements)
> CFtimestamp(my |> filter(!is.na(time)) |> pull(time))[[1]])
[1] "2020-05-04" "2020-05-09"

Let me know what you think.

mdsumner commented 10 months ago

hey this looks good, I don't have much time to explore but if you PR the changes I will review. I think it's no harm to add information to the output like this, but I will try checks on a wide range of files.

pvanlaake commented 10 months ago

PR made. I checked on my collection of weird NetCDF files and it all works fine. ncdf4 files also directly supported, either with dim vals loaded (the ncdf4 default) or not (when suppress_dimvals = T or readunlim = F in nc_open()).

mdsumner commented 9 months ago

I have not had a chance to look in detail, but I will at some point

I could go for a CRAN submission before the holidays if you need?

pvanlaake commented 9 months ago

Hi Michael, I have been busy with Real Life over the past weeks, just like you apparently. The PR that you merged is good to go and seeing that the last release was over a year ago I'd say a new CRAN submission would be a nice gift under the Christmas tree for ncmeta users.

Let me know if you need any help prepping the release.

pvanlaake commented 5 months ago

Incorporated in release 0.4.0