pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.64k stars 1.09k forks source link

Bottleneck and dask objects ignore `min_periods` on `rolling` #4922

Open bradyrx opened 3 years ago

bradyrx commented 3 years ago

What happened:

When bottleneck is installed in an environment, it seems to ignore the min_periods kwarg on ds.rolling(...).

What you expected to happen:

When using ds.rolling(..., min_periods=1), it should be able to handle an array of length 1. Without bottleneck installed, it returns the original value of a length 1 array. With bottleneck installed, the error is:

ValueError: Moving window (=2) must between 1 and 1, inclusive

Minimal Complete Verifiable Example:

With bottleneck installed to environment:

import xarray as xr
ds = xr.DataArray([1], dims='time')
ds.rolling(time=2, center=True, min_periods=1).mean()
ValueError: Moving window (=2) must between 1 and 1, inclusive

Without bottleneck installed to environment:

import xarray as xr
ds = xr.DataArray([1], dims='time')
ds.rolling(time=2, center=True, min_periods=1).mean()

>>> <xarray.DataArray (time: 1)>
>>> array([1.])
>>> Dimensions without coordinates: time

Anything else we need to know?:

In an applied case, this came up while working on .groupby('time.dayofyear').map(_rolling), where we map a rolling mean function over a defined N days with min_periods=1. Some climatological days (like leap years) will not have the N day requirement, so the min_period catch handles that, but with bottleneck installed it breaks due to the above issue.

Environment:

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 | packaged by conda-forge | (default, Jan 25 2021, 23:22:12) [Clang 11.0.1 ] python-bits: 64 OS: Darwin OS-release: 19.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.16.2 pandas: 1.2.1 numpy: 1.19.5 scipy: 1.6.0 netCDF4: 1.5.5.1 pydap: None h5netcdf: 0.8.1 h5py: 3.1.0 Nio: None zarr: 2.6.1 cftime: 1.3.1 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.1.8 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.01.1 distributed: 2021.01.1 matplotlib: 3.3.3 cartopy: 0.18.0 seaborn: 0.11.1 numbagg: None pint: 0.16.1 setuptools: 49.6.0.post20210108 pip: 21.0 conda: None pytest: 6.2.2 IPython: 7.18.1 sphinx: 3.4.3
dcherian commented 3 years ago

Maybe the padding is breaking down for length-1 arrays?

I would look at padded in this function https://github.com/pydata/xarray/blob/c9b9eec73a033e275b6012e7d391dd42591ccf52/xarray/core/rolling.py#L461-L463

bradyrx commented 3 years ago

@dcherian, to add to the complexity here, it's even weirder than originally reported. See my test cases below. This might alter how this bug is approached.

import xarray as xr

def _rolling(ds):
    return ds.rolling(time=6, center=False, min_periods=1).mean()

# Length 3 array to test that min_periods is called in, despite asking
# for 6 time-steps of smoothing
ds = xr.DataArray([1, 2, 3], dims='time')
ds['time'] = xr.cftime_range(start='2021-01-01', freq='D', periods=3)

1. With bottleneck installed, min_periods is ignored as a kwarg with in-memory arrays.

(bottleneck installed)

# Just apply rolling to the base array.
ds.rolling(time=6, center=False, min_periods=1).mean()

>>> ValueError: Moving window (=6) must between 1 and 3, inclusive

# Group into single day climatology groups and apply
ds.groupby('time.dayofyear').map(_rolling)

>>> ValueError: Moving window (=6) must between 1 and 1, inclusive

2. With bottleneck uninstalled, min_periods works with in-memory arrays.

(bottleneck uninstalled)

# Just apply rolling to the base array.
ds.rolling(time=6, center=False, min_periods=1).mean()

>>> <xarray.DataArray (time: 3)>
>>> array([1. , 1.5, 2. ])
>>> Coordinates:
>>> * time     (time) object 2021-01-01 00:00:00 ... 2021-01-03 00:00:00

# Group into single day climatology groups and apply
ds.groupby('time.dayofyear').map(_rolling)

>>> <xarray.DataArray (time: 3)>
>>> array([1., 2., 3.])
>>> Coordinates:
>>>  * time     (time) object 2021-01-01 00:00:00 ... 2021-01-03 00:00:00

3. Regardless of bottleneck, dask objects ignore min_period when a groupby object.

This specifically seems like an issue with .map()

(independent of bottleneck installation)

# Just apply rolling to the base array.
ds.chunk().rolling(time=6, center=False, min_periods=1).mean().compute()

>>> <xarray.DataArray (time: 3)>
>>> array([1. , 1.5, 2. ])
>>> Coordinates:
>>>   * time     (time) object 2021-01-01 00:00:00 ... 2021-01-03 00:00:00

# Group into single day climatology groups and apply
ds.chunk().groupby('time.dayofyear').map(_rolling)

>>> ValueError: For window size 6, every chunk should be larger than 3, but the smallest chunk size is 1.
>>> Rechunk your array
>>> with a larger chunk size or a chunk size that
>>> more evenly divides the shape of your array.
dcherian commented 3 years ago
# Just apply rolling to the base array.
ds.rolling(time=6, center=False, min_periods=1).mean()

I feel like this should not work i.e. rolling window length (6) < size along axis (3). So the bottleneck error seems right.

The chunk size error in the last example should go away with #4977

bradyrx commented 3 years ago

I feel like this should not work i.e. rolling window length (6) < size along axis (3). So the bottleneck error seems right.

This is normally the case, but with min_periods=1 it should just return the given value so long as there's at least one observation (as in case #2, where the boundaries return as normal and the middle number is smoothed).

Thanks for the pointer on #4977!

schild commented 2 years ago

encountered the same problem by Bottleneck.move_rank() ; i have to judge the length of dataframe in advance

 vol_list =[3,5,10,20,30,60,90]
    for i in vol_list:
        if len(last_df) >= i:
            last_df['vol_big_than_today_count_'+str(i)] = move_rank(last_df['vol'], i)