nansencenter / nansat

Scientist friendly Python toolbox for processing 2D satellite Earth observation data.
http://nansat.readthedocs.io
GNU General Public License v3.0
182 stars 66 forks source link

Modify export function #525

Closed mortenwh closed 1 year ago

mortenwh commented 1 year ago

GDAL changes the attribute names by prepending "GDAL_" and adds band numbers to the variable names. This should not be done. Also, it is a bit unclear to me if np.nan and netCDF fill values are treated correctly. This seems to be related to #123 as well.

Regarding the attribute and variable names, I suggest adding some "cleaning" functions at the end of the export function.

A PR related to this will also contain some other minor changes that I've made in my own fork.

Regarding the fill values, I have played with the arctic.nc test file. It turns out the fill values in that file are -10000 and not -32767 as given by the metadata:

In [3]: ds = Dataset('data/arctic.nc')

In [4]: ds.variables['Bootstrap']
Out[4]: 
<class 'netCDF4._netCDF4.Variable'>
int16 Bootstrap(time, y, x)
    grid_mapping: polar_stereographic
    description: Sea ice algorithm name: Bootstrap
    short_name: ice_conc
    minmax: 0 100
    colormap: jet
    standard_name: sea_ice_area_fraction
    long_name: Sea Ice Concentration (Bootstrap)
    valid_min: 0
    units: %
    wkv: sea_ice_area_fraction
    valid_max: 100
    name: Bootstrap
unlimited dimensions: 
current shape = (1, 240, 240)
filling on, default _FillValue of -32767 used

In [5]: ds.variables['Bootstrap'][:].data
Out[5]: 
array([[[     0,      0,      0, ...,      0,      0,      0],
        [     0,      0,      0, ...,      0,      0,      0],
        [     0,      0,      0, ...,      0,      0,      0],
        ...,
        [     0,      0,      0, ..., -10000, -10000, -10000],
        [     0,      0,      0, ..., -10000, -10000, -10000],
        [     0,      0,      0, ..., -10000, -10000, -10000]]],
      dtype=int16)

But netCDF4 treats the data correctly:

In [4]: ds.variables['Bootstrap'][:]
Out[4]: 
masked_array(
  data=[[[0, 0, 0, ..., 0, 0, 0],
         [0, 0, 0, ..., 0, 0, 0],
         [0, 0, 0, ..., 0, 0, 0],
         ...,
         [0, 0, 0, ..., --, --, --],
         [0, 0, 0, ..., --, --, --],
         [0, 0, 0, ..., --, --, --]]],
  mask=[[[False, False, False, ..., False, False, False],
         [False, False, False, ..., False, False, False],
         [False, False, False, ..., False, False, False],
         ...,
         [False, False, False, ...,  True,  True,  True],
         [False, False, False, ...,  True,  True,  True],
         [False, False, False, ...,  True,  True,  True]]],
  fill_value=-32767,
  dtype=int16)

Opening the file with nansat using python3.7 and GDAL3.0.4:

In [11]: n = Nansat('nansat/tests/data/arctic.nc')
HDF5-DIAG: Error detected in HDF5 (1.10.4) thread 140474394756928:
  #000: H5F.c line 509 in H5Fopen(): unable to open file
    major: File accessibilty
    minor: Unable to open file
  #001: H5Fint.c line 1400 in H5F__open(): unable to open file
    major: File accessibilty
    minor: Unable to open file
  #002: H5Fint.c line 1700 in H5F_open(): unable to read superblock
    major: File accessibilty
    minor: Read failed
  #003: H5Fsuper.c line 411 in H5F__super_read(): file signature not found
    major: File accessibilty
    minor: Not an HDF5 file

In [12]: n['Bootstrap']
Out[12]: 
array([[     0,      0,      0, ..., -10000, -10000, -10000],
       [     0,      0,      0, ..., -10000, -10000, -10000],
       [     0,      0,      0, ..., -10000, -10000, -10000],
       ...,
       [     0,      0,      0, ...,      0,      0,      0],
       [     0,      0,      0, ...,      0,      0,      0],
       [     0,      0,      0, ...,      0,      0,      0]], dtype=int16)

And we don't have information to mask the invalid data in Nansat

In [13]: n.get_metadata(band_id='Bootstrap', key='_FillValue')
Out[13]: '-32767.0'

With python3.9 and GDAL3.4.1:

In [13]: n = Nansat('data/arctic.nc')

In [14]: n['Bootstrap']
Out[14]: 
array([[     0,      0,      0, ..., -32767, -32767, -32767],
       [     0,      0,      0, ..., -32767, -32767, -32767],
       [     0,      0,      0, ..., -32767, -32767, -32767],
       ...,
       [     0,      0,      0, ...,      0,      0,      0],
       [     0,      0,      0, ...,      0,      0,      0],
       [     0,      0,      0, ...,      0,      0,      0]], dtype=int16)

And here we have the same fill value in nansat:

In [6]: n.get_metadata(band_id='Bootstrap', key='_FillValue')
Out[6]: '-32767.0'

The nansat code is the same but I'm wondering if gdal v3.4.1 masks values outside the interval valid_min:valid_max.

What to do, except being careful when using nansat?