spencerahill / aospy

Python package for automated analysis and management of gridded climate data
Apache License 2.0
83 stars 12 forks source link

Bump xarray to 0.14.1 and python to 3.6 #334

Closed spencerahill closed 4 years ago

spencerahill commented 4 years ago

In #235, we switched from scipy to netcdf4 as the engine for our to_netcdf calls. And we included the format='NETCDF3_64BIT' kwarg. I can't recall why we did that, and neither the PR nor code seem to say anything about it.

I bring this up because our tests failed with xarray 0.14.1 with that option included, but omitting it everything passes.

@spencerkclark do you happen to remember why we did this? And any other comments on the PR would be welcome. Note that the failures are from the now obsolete python 3.5 CI jobs; all the others pass.

spencerkclark commented 4 years ago

I bring this up because our tests failed with xarray 0.14.1 with that option included, but omitting it everything passes.

Huh. That's surprising to me. Out of curiosity, do you have an example test failure or traceback?

@spencerkclark do you happen to remember why we did this?

If I recall correctly, our reason for using format='NETCDF3_64BIT' was that ncview was outdated on the GFDL PP/AN cluster and could not read netCDF files with more modern formats (#69). The SciPy netCDF I/O backend only supports NETCDF3 files, and so uses that format automatically; in the case of the netcdf4 backend, we need to specify it with a kwarg.

spencerahill commented 4 years ago

If I recall correctly, our reason for using format='NETCDF3_64BIT' was that ncview was outdated on the GFDL PP/AN cluster and could not read netCDF files with more modern formats (#69). T

That's right. Thanks! Not a compelling reason to keep using this old format (right?).

do you have an example test failure or traceback?

Sure:

(aospy-dev) [xarray-0.14.1 !]shill@shillbook:~/Dropbox/py/aospy/aospy/test$ pytest test_calc_basic.py
==================================================================== test session starts =====================================================================
platform darwin -- Python 3.7.2, pytest-5.0.1, py-1.8.0, pluggy-0.13.0
rootdir: /Users/shill/Dropbox/py/aospy, inifile: setup.cfg
plugins: cov-2.6.1
collected 170 items

test_calc_basic.py FFFFF^CFFFFF^C

========================================================================== FAILURES ==========================================================================
_________________________________________________________ test_annual_mean[datetime-basic-None-None] _________________________________________________________

test_params = {'date_range': (datetime.datetime(4, 1, 1, 0, 0), datetime.datetime(6, 12, 31, 0, 0)), 'dtype_in_time': 'ts', 'dtype_in_vert': None, 'dtype_out_vert': None, ...}

    def test_annual_mean(test_params):
        calc = Calc(intvl_out='ann', dtype_out_time='av', **test_params)
>       calc.compute()

test_calc_basic.py:110:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../calc.py:449: in compute
    save_files=True, write_to_tar=write_to_tar)
../calc.py:528: in save
    self._save_files(data, dtype_out_time)
../calc.py:468: in _save_files
    data_out.to_netcdf(path, engine='netcdf4', format='NETCDF3_64BIT')
/Users/shill/miniconda3/envs/aospy-dev/lib/python3.7/site-packages/xarray/core/dataset.py:1547: in to_netcdf
    invalid_netcdf=invalid_netcdf,
/Users/shill/miniconda3/envs/aospy-dev/lib/python3.7/site-packages/xarray/backends/api.py:1073: in to_netcdf
    dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
/Users/shill/miniconda3/envs/aospy-dev/lib/python3.7/site-packages/xarray/backends/api.py:1119: in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
/Users/shill/miniconda3/envs/aospy-dev/lib/python3.7/site-packages/xarray/backends/common.py:293: in store
    variables, attributes = self.encode(variables, attributes)
/Users/shill/miniconda3/envs/aospy-dev/lib/python3.7/site-packages/xarray/backends/common.py:383: in encode
    variables = {k: self.encode_variable(v) for k, v in variables.items()}
/Users/shill/miniconda3/envs/aospy-dev/lib/python3.7/site-packages/xarray/backends/common.py:383: in <dictcomp>
    variables = {k: self.encode_variable(v) for k, v in variables.items()}
/Users/shill/miniconda3/envs/aospy-dev/lib/python3.7/site-packages/xarray/backends/netCDF4_.py:436: in encode_variable
    variable = encode_nc3_variable(variable)
/Users/shill/miniconda3/envs/aospy-dev/lib/python3.7/site-packages/xarray/backends/netcdf3.py:82: in encode_nc3_variable
    attrs = encode_nc3_attrs(var.attrs)
/Users/shill/miniconda3/envs/aospy-dev/lib/python3.7/site-packages/xarray/backends/netcdf3.py:72: in encode_nc3_attrs
    return {k: encode_nc3_attr_value(v) for k, v in attrs.items()}
/Users/shill/miniconda3/envs/aospy-dev/lib/python3.7/site-packages/xarray/backends/netcdf3.py:72: in <dictcomp>
    return {k: encode_nc3_attr_value(v) for k, v in attrs.items()}
/Users/shill/miniconda3/envs/aospy-dev/lib/python3.7/site-packages/xarray/backends/netcdf3.py:65: in encode_nc3_attr_value
    value = coerce_nc3_dtype(np.atleast_1d(value))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

arr = array([-9223372036854775808])

    def coerce_nc3_dtype(arr):
        """Coerce an array to a data type that can be stored in a netCDF-3 file

        This function performs the following dtype conversions:
            int64 -> int32
            bool -> int8

        Data is checked for equality, or equivalence (non-NaN values) with
        `np.allclose` with the default keyword arguments.
        """
        dtype = str(arr.dtype)
        if dtype in _nc3_dtype_coercions:
            new_dtype = _nc3_dtype_coercions[dtype]
            # TODO: raise a warning whenever casting the data-type instead?
            cast_arr = arr.astype(new_dtype)
            if not (cast_arr == arr).all():
                raise ValueError(
>                   f"could not safely cast array from dtype {dtype} to {new_dtype}"
                )
E               ValueError: could not safely cast array from dtype int64 to int32

/Users/shill/miniconda3/envs/aospy-dev/lib/python3.7/site-packages/xarray/backends/netcdf3.py:53: ValueError
-------------------------------------------------------------------- Captured stderr call --------------------------------------------------------------------
INFO:root:Getting input data: Var instance "condensation_rain" (Tue Dec 10 16:48:01 2019)
WARNING:root:Datapoints were stored using the np.float32 datatype.For accurate reduction operations using bottleneck, datapoints are being cast to the np.float64 datatype. For more information see: https://github.com/pydata/xarray/issues/1346
INFO:root:Computing timeseries for 0004-01-01 00:00:00 -- 0006-12-31 00:00:00.
INFO:root:Applying desired time-reduction methods. (Tue Dec 10 16:48:02 2019)
INFO:root:Writing desired gridded outputs to disk.
--------------------------------------------------------------------- Captured log call ----------------------------------------------------------------------
INFO     root:calc.py:299 Getting input data: Var instance "condensation_rain" (Tue Dec 10 16:48:01 2019)
WARNING  root:data_loader.py:135 Datapoints were stored using the np.float32 datatype.For accurate reduction operations using bottleneck, datapoints are being cast to the np.float64 datatype. For more information see: https://github.com/pydata/xarray/issues/1346
INFO     root:calc.py:439 Computing timeseries for 0004-01-01 00:00:00 -- 0006-12-31 00:00:00.
INFO     root:calc.py:423 Applying desired time-reduction methods. (Tue Dec 10 16:48:02 2019)
INFO     root:calc.py:443 Writing desired gridded outputs to disk.
spencerkclark commented 4 years ago

Thanks! I agree this is not something to get too hung up on in aospy. I think it's totally reasonable to start writing files out in a modern format. I do wonder though if this would be considered a bug in xarray.

A git bisect session suggests this is the commit in xarray that was the culprit. Let me see if I can come up with a minimal working example.

$ git bisect good
eece07932d5498a8abef6a8fbd30d00066931b18 is the first bad commit
commit eece07932d5498a8abef6a8fbd30d00066931b18
Author: Anderson Banihirwe <axbanihirwe@ualr.edu>
Date:   Wed Nov 13 18:22:50 2019 -0700

    Harmonize `FillValue` and `missing_value` during encoding and decoding steps (#3502)

    * Replace `equivalent()` with `allclose_or_equiv()`

    * Ensure _FillValue & missing_value are cast to same dtype as data's

    * Use Numpy scalar during type casting

    * Update ValueError message

    * Formatting only

    * Update whats-new.rst

 doc/whats-new.rst           |  2 ++
 xarray/coding/variables.py  | 14 ++++++++++----
 xarray/tests/test_coding.py | 17 +++++++++++++++++
 3 files changed, 29 insertions(+), 4 deletions(-)
spencerkclark commented 4 years ago

I think I found a minimal example; this gives a different error, but I think it's a symptom of the same problem (in fact you don't even need to specify the old file format in this case):

In [1]: import numpy as np; import pandas as pd; import xarray as xr

In [2]: times = pd.date_range('2000', periods=3)

In [3]: da = xr.DataArray(times, dims=['a'], coords=[[1, 2, 3]], name='foo')

In [4]: da.encoding['_FillValue'] = 1.0e20

In [5]: da.encoding['dtype'] = np.dtype('float64')

In [6]: da.to_dataset().to_netcdf('test.nc', format='NETCDF3_64BIT')
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-6-736fea18f4cd> in <module>
----> 1 da.to_dataset().to_netcdf('test.nc', format='NETCDF3_64BIT')

~/Software/xarray/xarray/core/dataset.py in to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf)
   1548             unlimited_dims=unlimited_dims,
   1549             compute=compute,
-> 1550             invalid_netcdf=invalid_netcdf,
   1551         )
   1552

~/Software/xarray/xarray/backends/api.py in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
   1071         # to be parallelized with dask
   1072         dump_to_store(
-> 1073             dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
   1074         )
   1075         if autoclose:

~/Software/xarray/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   1117         variables, attrs = encoder(variables, attrs)
   1118
-> 1119     store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
   1120
   1121

~/Software/xarray/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    291             writer = ArrayWriter()
    292
--> 293         variables, attributes = self.encode(variables, attributes)
    294
    295         self.set_attributes(attributes)

~/Software/xarray/xarray/backends/common.py in encode(self, variables, attributes)
    380         # All NetCDF files get CF encoded by default, without this attempting
    381         # to write times, for example, would fail.
--> 382         variables, attributes = cf_encoder(variables, attributes)
    383         variables = {k: self.encode_variable(v) for k, v in variables.items()}
    384         attributes = {k: self.encode_attribute(v) for k, v in attributes.items()}

~/Software/xarray/xarray/conventions.py in cf_encoder(variables, attributes)
    758     _update_bounds_encoding(variables)
    759
--> 760     new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
    761
    762     # Remove attrs from bounds variables (issue #2921)

~/Software/xarray/xarray/conventions.py in <dictcomp>(.0)
    758     _update_bounds_encoding(variables)
    759
--> 760     new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
    761
    762     # Remove attrs from bounds variables (issue #2921)

~/Software/xarray/xarray/conventions.py in encode_cf_variable(var, needs_copy, name)
    248         variables.UnsignedIntegerCoder(),
    249     ]:
--> 250         var = coder.encode(var, name=name)
    251
    252     # TODO(shoyer): convert all of these to use coders, too:

~/Software/xarray/xarray/coding/variables.py in encode(self, variable, name)
    163         if fv is not None:
    164             # Ensure _FillValue is cast to same dtype as data's
--> 165             encoding["_FillValue"] = data.dtype.type(fv)
    166             fill_value = pop_to(encoding, attrs, "_FillValue", name=name)
    167             if not pd.isnull(fill_value):

OverflowError: Python int too large to convert to C long

I'll put together an xarray issue. I think I have a decent idea of what the solution needs to be.

spencerahill commented 4 years ago

Cool, thanks @spencerkclark for digging into that!

OK, going ahead with the merge then...the failure is because of the now-obsolete python 3.5 environment. All others are green.