pytroll / satpy

Python package for earth-observing satellite data processing
http://satpy.readthedocs.org/en/latest/
GNU General Public License v3.0
1.06k stars 292 forks source link

cf_writer adds a _FillValue = NaN to coordinate variables #2845

Open joleenf opened 3 months ago

joleenf commented 3 months ago

The cf_writer adds a _FillValue to the final netCDF output for the lat/lon coordinates. In this case, adapted from an abi_fixed_grid, the dims are ("y", "x") while the coords are ("latitude", "longitude") even so, the coordinates should not contain a _FillValue upon writing to the netCDF file. https://github.com/pydata/xarray/issues/1598

Reproduce:

import dask.array as da
import os
import xarray as xr

from pyresample import AreaDefinition
from satpy import Scene

area_id = "abi_geos"
description = "Strided ABI L2 File Area"
proj_id = "abi_geos"
projection = {'ellps': 'GRS80', 'h': '35786023', 'lon_0': '-75', 'no_defs': 'None', 'proj': 'geos', 'sweep': 'x',
              'type': 'crs', 'units': 'm', 'x_0': '0', 'y_0': '0'}
width = 100
height = 100
area_extent = (-3627271.341, 1583173.822, 1382771.9518, 4589199.7649)
area_def = AreaDefinition(area_id, description, proj_id, projection, width, height, area_extent)
lonlats = area_def.get_lonlats()

scn = Scene()

scn["test"] = xr.DataArray(data=da.zeros((100, 100)), dims=("y", "x"), attrs={"name": "test", "area": area_def})

cf_ds = scn.to_xarray()
print(f"Longitude Encoding: {cf_ds['longitude'].encoding}")
print(f"Longitude Attrs: {cf_ds['longitude'].attrs}")
print(f"Longitude Min: {lonlats[0].min()}, Latitude Min: {lonlats[1].min()}")
print(f"Longitude Max: {lonlats[0].max()}, Latitude Max: {lonlats[1].max()}")
outpath = os.path.join(os.path.expanduser("~"), "test.nc")
scn.save_datasets(filename=outpath)

Though the original attributes do not contain a _FillValue, the resulting netCDF does, the code above prints:

Longitude Encoding: {}
Longitude Attrs: {'name': 'longitude', 'standard_name': 'longitude', 'units': 'degrees_east'}
Longitude Min: -147.87682093113054, Latitude Min: 14.704946854219488
Longitude Max: inf, Latitude Max: inf

but an ncdump -h on test.nc shows the addition of the _FillValue:

    double longitude(y, x) ;
        longitude:_FillValue = NaN ;
        longitude:name = "longitude" ;
        longitude:standard_name = "longitude" ;
        longitude:units = "degrees_east" ;
    double latitude(y, x) ;
        latitude:_FillValue = NaN ;
        latitude:name = "latitude" ;
        latitude:standard_name = "latitude" ;
        latitude:units = "degrees_north" ;
    double test(y, x) ;
        test:_FillValue = NaN ;
        test:grid_mapping = "abi_geos" ;
        test:long_name = "test" ;
        test:coordinates = "latitude longitude" ;

However, it is possible to add encoding to save_datasets to save the lat/lon without _FillValue: scn.save_datasets(filename=test.nc, encoding={"latitude": {"_FillValue": None}, "longitude": {"_FillValue": None}})

        longitude:name = "longitude" ;
        longitude:standard_name = "longitude" ;
        longitude:units = "degrees_east" ;
    double latitude(y, x) ;
        latitude:name = "latitude" ;
        latitude:standard_name = "latitude" ;
        latitude:units = "degrees_north" ;
    double test(y, x) ;
        test:_FillValue = NaN ;
        test:grid_mapping = "abi_geos" ;
        test:long_name = "test" ;
        test:coordinates = "latitude longitude" 

It seems that the encoding for writing netCDF data should include this somewhere in save_datasets rather than being typed explicitly.

djhoese commented 3 months ago

I'm not sure I agree that there should be no _FillValue. The argument in this xarray issue (https://github.com/pydata/xarray/issues/1865) if I remember correctly is more about matching the CF standard. In CF a coordinate variable is a 1D variable that matches the name of a dimension. As we discussed on slack, it makes sense (as mentioned in the xarray issue about the CF standard) that you can't have fill values on a coordinate 1D variable. You can't have a pixel of data that has a "location" of (NaN, NaN). It just doesn't make sense. BUT our 2D lon/lats in CF-land I don't think are technically coordinate variables at least as far as the missing value concern...is concerned. This section of the CF docs makes me think that it is expected:

http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#reduced-horizontal-grid

Storing this type of gridded data in two-dimensional arrays wastes space, and results in the presence of missing values in the 2D coordinate variables.

So on this page:

http://cfconventions.org/cf-conventions/cf-conventions.html#terminology

auxiliary coordinate variable

Any netCDF variable that contains coordinate data, but is not a coordinate variable (in the sense of that term defined by the NUG and used by this standard - see below). Unlike coordinate variables, there is no relationship between the name of an auxiliary coordinate variable and the name(s) of its dimension(s).

Which is used:

http://cfconventions.org/cf-conventions/cf-conventions.html#missing-data

Missing data is allowed in data variables and auxiliary coordinate variables. Generic applications should treat the data as missing where any auxiliary coordinate variables have missing values; special-purpose applications might be able to make use of the data. Missing data is not allowed in coordinate variables.

So if our 2D lon/lats are considered "auxiliary" then they're fine to have a _FillValue in CF.

Anyway, my opinion is that the lon/lat 2D arrays should have a _FillValue or at the very least a valid_range.