hypertidy / ncmeta

Tidy NetCDF metadata
https://hypertidy.github.io/ncmeta/
11 stars 5 forks source link

Should nc_att() include the attribute name? #12

Closed dblodgett-usgs closed 5 years ago

dblodgett-usgs commented 5 years ago

I'm seeing [1] "attribute" "variable" "value" where attribute is numeric instead of character unless you request by character in which case it comes back character.

It would be pretty useful to get the attribute name rather than the index to work with standard cf attributes. Would that be useful to you too?

Starting to look at the code I am puttering around in nc_att adding the name and see two ways this could be done now. Since you've already got the value column as a list column, we could make it a named list. i.e. now I see something like:

> a$value
[[1]]
[1] "Chlorophyll Concentration, OCI Algorithm"

and it would better as?

> t$value
$long_name
[1] "Chlorophyll Concentration, OCI Algorithm"

Or we could modify the returned tibble so it has an "id" and a "name" column. The big issue now is that attributes can be requested by id or by name -- which is fine from a request point of view, but results in different kinds of output in the "attribute" column of the current output.

e.g. try this:

f <- system.file("extdata", "S2008001.L3m_DAY_CHL_chlor_a_9km.nc", package = "ncmeta")
nc_att(f, 0, 0)
nc_att(f, "chlor_a", "long_name")

I started implementing a solution that does this:

> f <- system.file("extdata", "S2008001.L3m_DAY_CHL_chlor_a_9km.nc", package = "ncmeta")
> nc_att(f, 0, 0)
# A tibble: 1 x 4
     id name      variable value    
* <dbl> <chr>        <dbl> <list>   
1     0 long_name        0 <chr [1]>
> nc_att(f, "chlor_a", "long_name")
# A tibble: 1 x 4
     id name      variable value    
* <dbl> <chr>     <chr>    <list>   
1     0 long_name chlor_a  <chr [1]>

At the end of the day, I guess it comes down to how one would want to use the output. If the intention is for that "attribute" column to be used as a key to the semantics of the requester, I'd say leave it the way it is. If it's mean to be an identifier for the attribute, it's ambiguous and it should be changed to be "id" and "name"?

I'd be happy to implement and PR a solution here depending what you think would be most useful and unintrusive. Thinking I can probably contribute a few things here that I'll use in some work I'm doing that also uses stars so getting on the same page re: your vision would be helpful.

Cheers!

mdsumner commented 5 years ago

This is a very good point. I've tended to shy away from attributes because they are complicated compared to the raw data. I definitely like the idea of returning tables that are normal form , so if we ask for nc_att(varidentifier) the table should be nrow == n_atts of that variable. I'm definitely not consistent on that in ncmeta. I appreciate any PRs in this direction, and I'll be very positive about contributions!.

To contextualize, ncdf4 and RNetCDF are extremely different, the former returns all metadata in one connection object and you are expected to traverse the tree - it's not normal form, there are many redundancies - RNetCDF has more verbs to extract each part, but it's not as efficient because it doesn't manage the file connection as well. I found RNetCDF easier to build upon, but ncdf4 is faster generally.

I don't have a handle on attributes yet, but tidync has a very strong idea of grids and variables and dimensions, and they are automatable in powerful ways. The idea of ncmeta is to protect tidync (and friends) from these details, but as you clearly identify - the attributes in NetCDF are not clearly modelled here yet.

dblodgett-usgs commented 5 years ago

OK. Yeah, I agree re: normal form. I'll get a PR together for that and see what you think.

In general, I think I can help with the NetCDF attributes. Been working in the CF community for a long time and generally know my way around vagaries of the spec.