R03: need for data format code indication

delphinedobler commented 1 year ago

In the former Excel spreadsheet, the format code (column data type with the following values: float, double, NC_Short (16-bit signed integer), NC_DOUBLE) was indicated and is used in the file checker. This information should be also reflected on the NVS side.

Is there a difference (subtlety) between double and NC_double ? It seems mainly to be a question of associated fill_value :

double fill_value is 99999 while NC_double is 9.9692099683868690e+36 As for float that is not mentioned as NC_float:
float fill_value is 99999.f while NC_float is 9.9692099683868690e+36 We need to clarify why we use both semantics (both netCDF NC_* and simple float/double) and if it is related to the used fill_values.

To tackle this, we could create an additionnal table (as I don't see any other tables that would fit but I may have missed it) with the list of netCDF types + float and double if it is proved relevant as questioned above : https://docs.unidata.ucar.edu/nug/current/md_types.html

Then the R03 entries would be mapped to the corresponding format code.

vpaba commented 1 year ago

Thanks for opening this ticket @delphinedobler. I've raised it with the wider BODC Vocab Team, as some CF vocabularies already exist on the NVS (though not for Data Types yet I believe): http://vocab.nerc.ac.uk/search_nvs/cvl/?searchstr=CF&options=identifier,preflabel,altlabel,governance

I will update on what I find on the BODC side!

In the meantime, like you say it would be good to understand whether the FillValue difference you've spotted is important, or whether we (Argo) would be happy with NC_FLOAT, NC_SHORT and NC_DOUBLE (and let go of 'double' and 'float).

apswong commented 1 year ago

@delphinedobler @vpaba The assignment of data types in the Argo netcdf files, such as double or nc_double, float or nc_float, is not due to the difference in FillValue, but is simply a matter of legacy. In the early years of Argo, we used the primitive data types that were generally used in programming (float, double, etc.) and assigned them an arbitrary but certain to be out-of-range number (99999.) as FillValues. More recently we started using the NetCDF data types (nc_float, nc_double, etc), which have their own defined FillValues. The result is that there is now a mix of data types (and FillValues) in the parameters xlsx. These differences are not important in terms of the data that we make public. However, we cannot rewrite the data types that are already assigned, because that will mean rewriting all the Argo netcdf files.

I don't think it's necessary to create an additional table for data types. But the assigned data type related to each parameter should be added in R03, similar to the min/max issue, since they are used by the File Checker.

vpaba commented 2 months ago

@apswong thanks for the explanation above. Is the data type something relevant to R03 alone, or to other collections as well?

apswong commented 2 months ago

@vpaba Please be careful with using the term "data type". In the context of R03, the column labelled "data type" refers to the parameter attributes. There is also an official variable called "DATA_TYPE", which I believe is in R01.

In terms of the parameter attributes, I'm not sure but I think they are only relevant in R03. Perhaps @tcarval can confirm?

nvs-vocabs / ArgoVocabs

R03: need for data format code indication #56