oceandatainterop / nc-eTAG

NetCDF file, metadata & data standards for electronic tagging datasets (nc-eTAG)
2 stars 0 forks source link

Temperature variable as a coordinate? #3

Open lesserwhirls opened 4 years ago

lesserwhirls commented 4 years ago

In the summary template, is the value coordinate for the attribute coverage_content_type correct?

https://github.com/oceandatainterop/nc-eTAG/blob/277f41e081914576f10fbceef6dbc0efaab85b25/nc-eTAG_PSATsummarydata_Template.cdl#L96

lesserwhirls commented 4 years ago

In my mind, what this all comes down to identifying what the coordinates for a frequency variable are. For sure, there is a time period, a latitude, a longitude, and a trajectory id associated with each set of summary statistics.

Latitude, longitude, and trajectory are pretty straight forward, and they are covered in the CDL.

To cover the time component, we need a reference time variable that associates a time with each distinct set of summary data, and we'd need to make sure that the reference time variable has a bounds attribute pointing to another variable which clearly defines the full time interval associated with each distinct set of summary data.

In the CDL, we have time as the reference time variable, and time_bnds as the bounds variable. This covers capturing the basics of the "pre-programmed time interval used to compute the frequencies".

However, interpreting the frequency of a data variable is also dependent on the bins used to summarize that variable (here I would say that the dependency makes the "bin" variable a coordinate). Let's say the data variable in question is temperature_frequency. So then "how do we describe the pre-programmed temperature intervals used to compute the frequencies." I think we can do this in a similar way that we do the time variable, which is what we have in the CDL currently:

double temperature(bins_freq);
  string temperature:bounds = “temperature_bnds”;

double temperature_bnds(bins_freq, bnds);

which leads us to something like:

double temperature_frequency(time, bins_freq);
  string temperature_frequency:coordinates = "time latitude longitude temperature trajectory";

where time and temperature provide a representative coordinate value, and are linked to time_bnds and temperature_bnds, which precisely define the pre-programmed intervals over which the frequencies are computed.

From the considerations above, I would say that temperature is a coordinate variable of the temperature_frequency variable. Dimensionality wise, the 1D time variable matches to the time dimension of temperature_frequency, and the 1D temperature variable matches to the bins_freq dimension of the temperature_frequency` variable, at least in terms of dimensionality (to explicitly be a coordinate variable, the variable name and the dimension name would need to match).

Taking things one step further, though, now we should ask "how do we make sure users know that the frequencies are computed over pre-programmed temperature intervals AND pre-programmed time intervals." That's where the cell_methods attribute comes into play. Note that the use of cell_methods here would need a few extensions to CF.

  1. Right off the bat, the section of the CF standard dealing with cells opens with "[w]hen gridded data...", which strongly implies the cell concept is limited to gridded data (gridded data, and related concepts, are referenced throughout the cell chapter). The CF document defines a cell as:

    A region in one or more dimensions whose boundary can be described by a set of vertices. The term interval is sometimes used for one-dimensional cells.

    The way I read that is that a "cell" captures the concept of a dimension that is described using a representative value with extents. It just so happens that a "grid cell" is easy to visualize mentally, but I think the basic cell concept can apply to the kinds of dimensions found within Discrete Sampling Geometry based datasets (but the extension/clarification to the language in the document will need to be made).

  2. CF does not currently have a cell_method value to describe a frequency summary statistic. Currently, the CDL says something like:

    string temperature_frequency:cell_methods = "time : temperature : count";

    and I think that's a good start, but it's a bit more complicated than that. (side note: in the current CDL, the cell_methods attribute is listed with type double, but should be string). First, count does not currently exist as a cell_method, so we'd need to propose adding it. If we're doing that, perhaps we should aim for something more precise, say:

    string temperature_frequency:cell_methods = "time : temperature : frequency_distribution";