ihesp / IPART

Image-Process based Atmospheric River Tracking (IPART) algorithms
https://ipart.readthedocs.io/en/latest/
GNU General Public License v3.0
24 stars 8 forks source link

netcdf CF convections #9

Open Xunius opened 4 years ago

Xunius commented 4 years ago

This is copying from comments made by @sadielbartholomew. See original post at https://github.com/openjournals/joss-reviews/issues/2407#issuecomment-667736221.


Continuing on the topic of improvements that are not compulsory towards acceptance in the paper given the open criteria, but would be good to think about going forwards, for good practice with metadata I suggest making more use of the CF Conventions (the recommended standard for netCDF), namely as described in the three points below.

  1. Increasing the compliance of the datasets included in the repository to the CF Conventions, especially those under the notebooks directory which users may interact with if they try out IPART with the provided Notebooks. Notably both uflux_s_6_1984_Jan.nc & vflux_s_6_1984_Jan.nc provided there are marked by global attribute as being CF-compliant to CF 1.6:

    
    :Conventions = "CF-1.6" ;

    which is okay (relative to the ideal, latest version, 1.8), but immediately I see improvements in compliance that could be made.

For example, the variable & dimensions are all described by a long_name attribute, where use of a standard_name attribute is preferable as each is unambiguous (see e.g. here). The time, lat & lon coordinates can take standard names of the same identifier as currently used for the long name, and from a quick search on the names table for "eastward" AND "vapor" I think the data itself with long_name=Vertical integral of eastward water vapour flux and units kg m**-1 s**-1 could probably be assigned a standard name of eastward_atmosphere_water_vapor_transport_across_unit_distance, or similar.

  1. Rephrasing aspects of the 'Data preparation' section of the documentation in terms of terminology from the CF Conventions. For instance, instead of stating there that:
    Source data are the u- and v- components of the vertically integrated vapor fluxes, in a rectangular grid.

you could explicitly state the standard names of data variables which would be applicable, e.g. something similar to northward_... and eastward_atmosphere_water_vapor_transport_across_unit_distance and maybe link to the definition of all grids which may be considered rectangular, for clarity. This would make it crystal clear whether a user's dataset(s) may be appropriate for processing by IPART.

  1. Making code changes to accommodate conventions. In particular, the required ordering of dimensions for IPART seemingly go against conventions given that the 'Data preparation' section states that: "the user is responsible for making sure that the data are saved in the following rank order: (time, level, latitude, longitude)" but as conveyed in this section "The CF convention places no rigid restrictions on the order of dimensions, however we encourage data producers to make the extra effort to stay within the COARDS standard order" where "COARDS restricts the axis (equivalently dimension) ordering to be longitude, latitude, vertical, and time".

So, you are advocating that users define data dimensions in the inverse order to that recommended. To make IPART more immediately accessible, you could amend your code so that it accepts the outlined conventional order, rather than the inverse.

Xunius commented 4 years ago

@sadielbartholomew I read about the standard_name in the links you gave and now I understand better about the differences between long_name and standard_name, I used to use the same string for both.

Regarding Point 2, the standard_names for the u- and v- flux components are indeed, as you said, eastward_atmosphere_water_transport_across_unit_distance and northward_atmosphere_water_transport_across_unit_distance.

Regarding Point 1, the uflux_s_6_1984_Jan.nc data in the notebooks folder are directly obtained from the ERA-I reanalysis data center, I selected the desired variable, time step and domain etc.. and downloaded them. So they came without a standard_name attribute, maybe it is the grib to nc conversion in their data server that omits the attribute. I never noticed this before.

If I understand correctly, there exists a pre-defined, permissible list of standard_names, so I can't just coin my new standard_name, like numerical_label_for_atmospheric_river or northward_atmosphere_water_transport_across_unit_distance_THR_anomaly_component, can I? In that case, do I just leave the attribute empty?

kbarnhart commented 4 years ago

@Xunius a couple of follow on points about this:

  1. I don't think that you need to do anything to files downloaded directly from the ERA-1 reanalysis data center as its not your role to make that file CF compliant.

  2. I would strongly recommend that you relax the requirements associated with dimension ordering and replace them with an expectation that specific dimensions are present (and named correctly). Netcdf and xarray both allow access to the name metadata. This will allow you to support both datasets that comply with the COARDS standard order and datasets such as the the ERA-I datacenter downloads which provide netcdfs in (time, lat, lon) order.

  3. Regarding standard names: You are correct that you can't create a new standard name (e.g., something listed in the CF standard name table) on your own. My experience is coming up with these names is often challenging and a real art. My recommendation is that you use a descriptive name that follows the style of the standard names for the "standard_name" attribute field. Also take advantage of the "long name", "units", and "_FillValue" fields in order to complete the description. As your example above shows, being descriptive often means making very very long names.

Also πŸ‘ πŸ‘ to @sadielbartholomew for such a thorough comment on CF-compliance and standard names. πŸš€