pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.56k stars 1.07k forks source link

rolling: allow control over padding #2007

Open mathause opened 6 years ago

mathause commented 6 years ago

Code Sample, a copy-pastable example if possible

import numpy as np
import xarray as xr

x = np.arange(1, 366)
y = np.random.randn(365)
ds = xr.DataArray(y, dims=dict(dayofyear=x))

ds.rolling(center=True, dayofyear=31).mean()

Problem description

rolling cannot directly handle periodic boundary conditions (lon, dayofyear, ...), but could be very helpful to e.g. calculate climate indices. Also I cannot really think of an easy way to append the first elements to the end of the dataset and then calculate rolling.

Is there a way to do this? Should xarray support this feature?

This might also belong to SO...

rabernat commented 6 years ago

Very useful suggestion.

Also I cannot really think of an easy way to append the first elements to the end of the dataset and then calculate rolling.

We already support a different type of "rolling" with periodicity http://xarray.pydata.org/en/latest/generated/xarray.DataArray.roll.html?highlight=roll and it is straightforward to apply roll operations at the variable level: https://github.com/pydata/xarray/blob/master/xarray/core/variable.py#L1007-L1026

I suspect this would not be too hard to implement.

max-sixty commented 6 years ago

ds.rolling(center=True, dayofyear=31).mean()

What do you mean by rolling here? As a reduce operation over a rolling window, like a moving average (rolling in xarray)? Or rolling around the end of a dataarray when shifting (roll in xarray)? Or a mix of both?

Specifically, what's the dayofyear=31 doing?

mathause commented 6 years ago

Probably a mix of both - I want to compute a moving average, but with periodic boundaries. rolling sets window_length - 1 elements to nan. However I want to calculate these like so:

running_mean_0 = xr.concat([ds[-15:], ds[:16]]).mean()
running_mean_1 = xr.concat([ds[-14:], ds[:17]]).mean()

and so on...

mathause commented 6 years ago

I think what I want is like the filter function in R with circular=True.

I found two possibilities but they are quite "hand made" and certainly not very efficient

Solution with slicing:

# take the last and first elements and append/ prepend them
first = ds[:15]
last = ds[-15:]
extended = xr.concat([last, ds, first], 'dayofyear')
# do the rolling on the extended ds and get rid of NaNs
sol1 = extended.rolling(dayofyear=31, center=True).mean().dropna('dayofyear')

Solution with roll:

roll1 = ds.roll(dayofyear=150).rolling(dayofyear=31, center=True).mean()
roll2 = ds.rolling(dayofyear=31, center=True).mean()
sol2 = xr.concat([roll1, roll2], dim='r').mean('r')

Call rolling on original and rolled dataset, and put them together again.

fujiisoup commented 6 years ago

I think the implementation would be not so difficult by supporting more flexible fill_value option in xr.DataArrayRolling.construct method.

Maybe fill_value='periodic' would be a possible API,

da.rolling(dayofyear=31).construct('window', fill_value='periodic').mean('window')

@mathause, any interest in contributing?

max-sixty commented 6 years ago

Though I'm not sure you need the construct machinery.

IIUC you need to copy a window-sized amount of data from the front of the array onto the back.

You could do that with construct-like machinery, which would save a copy - though not a large copy

fujiisoup commented 6 years ago

@maxim-lian , do you agree to add this feature? Although the same behavior can be realized by adding head/tail values to the original array and truncate them after the computation, the periodic option would significantly simplify this.

max-sixty commented 6 years ago

@fujiisoup Yes for sure - I think it would be good. I think there are two salient questions:

fujiisoup commented 6 years ago

I don't think the kwarg should be called fill_value - that traditionally has a specific meaning of "the value to replace NaN with".

Agreed. dask.ghost has boundaries keyword, for which we can choose between periodic, reflect, and any constant. I think this would be a good reference. Maybe we can deprecate fill_value keyword and replace it by boundaries? (I slightly regret that I choose fill_value keyword in construct).

How it's implemented - do you have a view here?

Only a slight modification of construct machinery realizes this (see #2011). I think this option should be available only in construct method (not in the traditional rolling constructor) for the sake of simplicity (according to this comment).

max-sixty commented 6 years ago

2011 looks good - I didn't realize numpy already had pad

Agree with your other comments. Thanks as ever @fujiisoup

raybellwaves commented 6 years ago

I was going to suggest this feature so glad others are interested.

In my use case I would like to smooth a daily climatology. My colleague uses matlab and uses https://www.mathworks.com/matlabcentral/fileexchange/52688-nan-tolerant-fast-smooth

Using the slice solution as @mathause showed above, it would look something like (using code from http://xarray.pydata.org/en/stable/examples/weather-data.html#toy-weather-data)

import numpy as np
import pandas as pd
import xarray as xr

times = pd.date_range('2000-01-01', '2010-12-31', name='time')
annual_cycle = np.sin(2 * np.pi * (times.dayofyear.values / 366 - 0.28))
noise = 15 * np.random.rand(annual_cycle.size)
data = 10 + (15 * annual_cycle) + noise
da = xr.DataArray(data, coords=[times], dims='time')
#da.plot()
#Check variability at one day
#da.groupby('time.dayofyear').std('time')[0]
da_clim = da.groupby('time.dayofyear').mean('time')
_da_clim = xr.concat([da_clim[-15:], da_clim, da_clim[:15]], 'dayofyear')
da_clim_smooth = _da_clim.rolling(dayofyear=31, center=True).mean().dropna('dayofyear')
#da_clim_smooth.plot()
mathause commented 5 years ago

I am coming back to @shoyer suggestion in #2011 - your idea would be to do first a pad and then a rolling operation as e.g.:


import numpy as np
import xarray as xr

x = np.arange(1, 366)
y = np.random.randn(365)
ds = xr.DataArray(y, dims=dict(dayofyear=x))

ds.pad(dayofyear=15, mode='wrap').rolling(center=True, dayofyear=31).mean()
shoyer commented 5 years ago

I think you may need to do cropping afterwards, too, before taking the mean.

mathause commented 3 years ago

I just need to find the three warmest consecutive months from a temperature dataset for my work, so I thought I add a complete example. First, create an example dataset with monthly temperature:

import xarray as xr
import numpy as np
import pandas as pd

time = pd.date_range("2000", periods=12 * 30, freq="M")
temp = np.sin((time.month - 5) / 6 * np.pi) + np.random.randn(*time.shape) * 0.3
da = xr.DataArray(temp, dims=["time"], coords=dict(time=time))
print(da)
<xarray.DataArray (time: 360)>
array([-0.676731, -0.812742, -1.367547, ...,  0.186731,  0.237676, -0.343879])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-31 2000-02-29 ... 2029-12-31

Currently we can achieve this like:

n_months = 3

monthly = da.groupby("time.month").mean()
padded = monthly.pad(month=n_months, mode="wrap")
rolled = padded.rolling(center=True, month=n_months).mean(skipna=False)
sliced = rolled.isel(month=slice(3, -3))

central_month = sliced.idxmax()

Implementing pad_mode in rolling would allow to do:

monthly = da.groupby("time.month").mean()
rolled = monthly.rolling(center=True, month=n_months, pad_mode="wrap").mean(skipna=False)

central_month = rolled.idxmax()
dcherian commented 3 years ago

I think we should do pad_kwargs so that monthly.rolling(center-True, month=n_months, pad_kwargs=dict(mode="constant", constant_values=5)} is possible.

mdgoldberg commented 3 years ago

Hello! First of all, thanks so much for those of you who contribute to xarray, I've found it super useful as an n-dimensional extension of pandas!

I was just wondering what the current state of this issue is? I'm running into exactly the issue described in https://github.com/pydata/xarray/issues/4743 which seems like a bug; that issue was closed as a dupe of this. Are we just waiting for someone to implement something here, or are there other blockers?

dcherian commented 3 years ago

This should be easy now so we just need someone to try it out.

This is where the padding happens so the kwarg needs to be passed all the way down to Variable.rolling_window https://github.com/pydata/xarray/blob/67d1955cc1d6941edd12a325090da7e4d029b84c/xarray/core/variable.py#L2132

What should the API be? .rolling(..., pad_kwargs={...}) or just have .pad(...).rolling(..., pad=False) i.e. have the user pad beforehand? The only advantage to the first one AFAICT is that the user doesn't have to specify the window length (or padding length) twice.

kmsquire commented 3 years ago

.pad(...).rolling(..., pad=False)

For this API, it seems that the only thing that would need to be implemented would be adding a pad keyword argument to rolling, defaulting to True. Is that correct?

dcherian commented 3 years ago

Yes. I think we might want this anyway; for rolling without any padding at all.

kmsquire commented 3 years ago

I added support for .rolling(..., pad=False) in #5603. The basic implementation was simple, but getting it working for bottleneck/dask took a little more work.

That fixes, e.g., #4743, but I don't think it's a complete fix for this issue.