index_netcdf could warn for common fillvalue errors

Index_netcdf uses var_range() in nchelpers to determine the range of a variable.

Sometimes processes output a netCDF file where the _FillValue attribute of a variable is not the Official Fill Attribute. This error is unfortunately common, but surprisingly hard to detect. For example, you can look at an affected file with ncdump:

(venv) [lzeman@lynx lzeman]$ ncdump -h /storage/data/projects/comp_support/climate_explorer_data_prep/climatological_means/return_periods/all-canada/pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990.nc 
netcdf pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990 {
dimensions:
    lon = 1068 ;
    lat = 510 ;
    time = UNLIMITED ; // (1 currently)
    bnds = 2 ;
variables:
    double lon(lon) ;
    double lat(lat) ;
    double time(time) ;
    double time_bnds(time, bnds) ;
    float rp5pr(time, lat, lon) ;
        rp5pr:_FillValue = 1.e+20f ;
        rp5pr:long_name = "5-year annual maximum one day precipitation amount" ;
        rp5pr:standard_name = "rp5pr" ;
        rp5pr:cell_methods = "time: maximum" ;
        rp5pr:units = "mm day-1" ;
        rp5pr:missing_value = 1.e+20f ;

// global attributes:
}

Or use ncview to look at the file: ncview

You can even look at this file in python:

>>> from netCDF4 import Dataset
>>> data = Dataset("pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990.nc")
>>> data.variables["rp5pr"]
<class 'netCDF4._netCDF4.Variable'>
float32 rp5pr(time, lat, lon)
    _FillValue: 1e+20
    long_name: 5-year annual maximum one day precipitation amount
    standard_name: rp5pr
    cell_methods: time: maximum
    units: mm day-1
    missing_value: 1e+20
unlimited dimensions: time
current shape = (1, 510, 1068)
filling on
>>> data.variables["rp5pr"][:]
masked_array(
  data=[[[--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         ...,
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --]]],
  mask=[[[ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         ...,
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True]]],
  fill_value=1e+20,
  dtype=float32)

and all looks reasonable.

However, if you get the variable range using var_range, the value of the _FillValue attribute will be included in the range:

>>> from nchelpers import CFDataset
>>> data = CFDataset("pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990.nc")
>>> data.var_range("rp5pr")
(7.149994, 1e+20)

So when this unfortunately-reasonable-looking file is indexed, the maximum variable value will be 1e+20, which was likely intended to be a fill value, judging from its presence in the _FillValue attribute.

This type of file error is quite hard to detect in advance, since it does not show up on any of the common netcdf-checking tools. It would be wonderful if index_netcdf would print a warning when the following happens:

a variable has a _FillValue attribute, and
the range of the variable, as returned by var_range, include the _FillValue attribute as either a minimum or a maximum.

That is a Bad Data Smell and whoever is indexing probably wants to know! Certainly would save me some headaches.

pacificclimate / modelmeta

index_netcdf could warn for common fillvalue errors #103