pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.6k stars 1.08k forks source link

uint type data are read as wrong type (float64) #6091

Closed zxdawn closed 2 years ago

zxdawn commented 2 years ago

What happened:

The uint data type variables are read as float64 instead of the correct uint type.

Minimal Complete Verifiable Example:

import xarray as xr

print(xr.open_dataset('test_save.nc')['processing_quality_flags'].dtype)

Anything else we need to know?:

The sample data is attached here. The output of ncdump -h test_save.nc:

netcdf test_save {
dimensions:
    y = 3246 ;
    x = 450 ;
variables:
    float longitude(y, x) ;
        longitude:_FillValue = NaNf ;
        longitude:name = "longitude" ;
        longitude:standard_name = "longitude" ;
        longitude:units = "degrees_east" ;
    float latitude(y, x) ;
        latitude:_FillValue = NaNf ;
        latitude:name = "latitude" ;
        latitude:standard_name = "latitude" ;
        latitude:units = "degrees_north" ;
    uint processing_quality_flags(y, x) ;
        processing_quality_flags:_FillValue = 4294967295U ;
        processing_quality_flags:comment = "Flags indicating conditions that affect quality of the retrieval." ;
        processing_quality_flags:end_time = "2019-07-02 05:00:24" ;
        processing_quality_flags:file_key = "PRODUCT/SUPPORT_DATA/DETAILED_RESULTS/processing_quality_flags" ;
        processing_quality_flags:file_type = "tropomi_l2" ;
        processing_quality_flags:flag_masks = 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 256U, 512U, 1024U, 2048U, 4096U, 8192U, 16384U, 32768U, 65536U, 131072U, 262144U, 524288U, 1048576U, 2097152U, 4194304U, 8388608U, 16777216U, 33554432U, 67108864U, 134217728U, 268435456U, 536870912U ;
        processing_quality_flags:flag_meanings = "success radiance_missing irradiance_missing input_spectrum_missing reflectance_range_error ler_range_error snr_range_error sza_range_error vza_range_error lut_range_error ozone_range_error wavelength_offset_error initialization_error memory_error assertion_error io_error numerical_error lut_error ISRF_error convergence_error cloud_filter_convergence_error max_iteration_convergence_error aot_lower_boundary_convergence_error other_boundary_convergence_error geolocation_error ch4_noscat_zero_error h2o_noscat_zero_error max_optical_thickness_error aerosol_boundary_error boundary_hit_error chi2_error svd_error dfs_error radiative_transfer_error optimal_estimation_error profile_error cloud_error model_error number_of_input_data_points_too_low_error cloud_pressure_spread_too_low_error cloud_too_low_level_error generic_range_error generic_exception input_spectrum_alignment_error abort_error wrong_input_type_error wavelength_calibration_error coregistration_error slant_column_density_error airmass_factor_error vertical_column_density_error signal_to_noise_ratio_error configuration_error key_error saturation_error max_num_outlier_exceeded_error solar_eclipse_filter cloud_filter altitude_consistency_filter altitude_roughness_filter sun_glint_filter mixed_surface_type_filter snow_ice_filter aai_filter cloud_fraction_fresco_filter aai_scene_albedo_filter small_pixel_radiance_std_filter cloud_fraction_viirs_filter cirrus_reflectance_viirs_filter cf_viirs_swir_ifov_filter cf_viirs_swir_ofova_filter cf_viirs_swir_ofovb_filter cf_viirs_swir_ofovc_filter cf_viirs_nir_ifov_filter cf_viirs_nir_ofova_filter cf_viirs_nir_ofovb_filter cf_viirs_nir_ofovc_filter refl_cirrus_viirs_swir_filter refl_cirrus_viirs_nir_filter diff_refl_cirrus_viirs_filter ch4_noscat_ratio_filter ch4_noscat_ratio_std_filter h2o_noscat_ratio_filter h2o_noscat_ratio_std_filter diff_psurf_fresco_ecmwf_filter psurf_fresco_stdv_filter ocean_filter time_range_filter pixel_or_scanline_index_filter geographic_region_filter input_spectrum_warning wavelength_calibration_warning extrapolation_warning sun_glint_warning south_atlantic_anomaly_warning sun_glint_correction snow_ice_warning cloud_warning AAI_warning pixel_level_input_data_missing data_range_warning low_cloud_fraction_warning altitude_consistency_warning signal_to_noise_ratio_warning deconvolution_warning so2_volcanic_origin_likely_warning so2_volcanic_origin_certain_warning interpolation_warning saturation_warning high_sza_warning cloud_retrieval_warning cloud_inhomogeneity_warning" ;
        processing_quality_flags:flag_values = 0U, 1U, 2U, 3U, 4U, 5U, 6U, 7U, 8U, 9U, 10U, 11U, 12U, 13U, 14U, 15U, 16U, 17U, 18U, 19U, 20U, 21U, 22U, 23U, 24U, 25U, 26U, 27U, 28U, 29U, 30U, 31U, 32U, 33U, 34U, 35U, 36U, 37U, 38U, 39U, 40U, 41U, 42U, 43U, 44U, 45U, 46U, 47U, 48U, 49U, 50U, 51U, 52U, 53U, 54U, 55U, 64U, 65U, 66U, 67U, 68U, 69U, 70U, 71U, 72U, 73U, 74U, 75U, 76U, 77U, 78U, 79U, 80U, 81U, 82U, 83U, 84U, 85U, 86U, 87U, 88U, 89U, 90U, 91U, 92U, 93U, 94U, 95U, 96U, 97U, 256U, 512U, 1024U, 2048U, 4096U, 8192U, 16384U, 32768U, 65536U, 131072U, 262144U, 524288U, 1048576U, 2097152U, 4194304U, 8388608U, 16777216U, 33554432U, 67108864U, 134217728U, 268435456U, 536870912U ;
        processing_quality_flags:long_name = "Processing quality flags" ;
        processing_quality_flags:modifiers = "" ;
        processing_quality_flags:platform_shortname = "S5P" ;
        processing_quality_flags:reader = "tropomi_l2" ;
        processing_quality_flags:sensor = "tropomi" ;
        processing_quality_flags:start_time = "2019-07-02 03:18:54" ;
        processing_quality_flags:coordinates = "latitude longitude" ;
}

Note that I can't reproduce it using this example:

import numpy as np
import xarray as xr

da = xr.DataArray(np.array([1,2,3], dtype='uint')).rename('test_array')
da.to_netcdf("test.nc", engine='netcdf4')
with xr.open_dataset('test.nc') as ds:
    print(ds['test_array'].dtype)

>>> uint64

Environment:

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.11.0-40-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 0.20.1 pandas: 1.3.4 numpy: 1.20.3 scipy: 1.7.3 netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: 3.6.0 Nio: None zarr: 2.10.3 cftime: 1.5.1.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.10 cfgrib: None iris: None bottleneck: None dask: 2021.11.2 distributed: 2021.11.2 matplotlib: 3.5.0 cartopy: 0.20.1 seaborn: None numbagg: None fsspec: 2021.11.1 cupy: None pint: 0.18 sparse: None setuptools: 59.4.0 pip: 21.3.1 conda: 4.11.0 pytest: None IPython: 7.30.0 sphinx: None
andersy005 commented 2 years ago

Note that I can't reproduce it using this example:

I could be wrong but it appears that when you introduce a _FillValue in your dataarray, you end up with the same outcome:

In [53]: import numpy as np
    ...: import xarray as xr
    ...: 
    ...: da = xr.DataArray(np.array([1,2,4294967295], dtype='uint')).rename('test_array')

In [56]: da.encoding['_FillValue'] = 4294967295
In [62]: da.to_netcdf("test.nc", engine='netcdf4')

In [63]: !ncdump -h test.nc
netcdf test {
dimensions:
        dim_0 = 3 ;
variables:
        uint64 test_array(dim_0) ;
                test_array:_FillValue = 4294967295ULL ;
data:

 test_array = 1, 2, _ ;
}
In [64]: d = Dataset("test.nc")

In [65]: d
Out[65]: 
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    dimensions(sizes): dim_0(3)
    variables(dimensions): uint64 test_array(dim_0)
    groups: 

In [66]: xr.open_dataset('test.nc')
Out[66]: 
<xarray.Dataset>
Dimensions:     (dim_0: 3)
Dimensions without coordinates: dim_0
Data variables:
    test_array  (dim_0) float64 ...
In [67]: xr.open_dataset('test.nc').test_array
Out[67]: 
<xarray.DataArray 'test_array' (dim_0: 3)>
array([ 1.,  2., nan])
Dimensions without coordinates: dim_0

Notice that xarray is using np.NaN as a sentinel value for the missing / fill_values. Because np.NaN is a float, this forces the entire array of integers to become floating pointing numbers...

zxdawn commented 2 years ago

Ha, thanks. It makes sense now. Shall we close this?

andersy005 commented 2 years ago

Ha, thanks. It makes sense now. Shall we close this?

Great! I'm closing this for the time being...