pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.61k stars 1.08k forks source link

Feature request: time-based rolling window functionality #3216

Open snbentley opened 5 years ago

snbentley commented 5 years ago

Hi,

I was hoping you would consider extending the rolling window functionality to time windows; as far as I can tell any rolling window functions simply work across a number of nearby points, and not points within, (say) an hour or minute window. This means that I can't even find a reliable rolling mean without writing code myself (and as I am relatively new to Python, this inevitably ends up uselessly slow)

This would extend all rolling functionality to unevenly sampled data, and to buggy data with quality gaps. It would also allow me/others to fix such datagaps by averaging and downsampling where appropriate.

(Context: basically all space physics data and probably other fields too. Really, this would need to be a centred window - I think pandas has a non-centred time window but that doesn't help much.)

Thanks for reading this! And sorry if this is already available - I couldn't find any settings for it.

(PS the multidimensionality of xarray is so useful for me, I have so many vector observations in so many different co-ordinate systems!)

dcherian commented 5 years ago

Does resample fit your needs? https://xarray.pydata.org/en/stable/time-series.html#resampling-and-grouped-operations

snbentley commented 5 years ago

Hi, I did actually just see this - it would solve the unevenly sampled data part but really I need to identify the unphysical values that are not tagged by the quality flags first. Once that has been done then resampling and interpolation would be great - but otherwise I will be spreading the effect of bad data.

For this particular set of data I am looking at, I often get individual points which are close to but clearly outliers from the time series so examining a rolling mean would help find these. That is the example I was hoping to solve with this query, but I have already realised that this extends to other problems I will encounter. For example, sudden jumps in the time series (for which I have been recommended to calculate rolling correlation coefficients between two time series) and multiple points jumping all over the place (for which I will probably compare the variance of groups of points and a rolling gradient)

(I really don't know why these aren't cleaned better first, but unfortunately that is the way things are)

Because I need to clean the data before any analysis, the resampling method would probably allow me to get rid of most but not all the bad data. Then I would have to be extra-cautious and throw out lots of possibly good observations just in case. I will definitely use resampling for the analysis but there are so many ways that this would be helpful at the processing stage.

mattrossman commented 4 years ago

I'm surprised this feature still hasn't made its way from pandas to xarray, it's incredibly helpful for datasets that are not evenly sampled. Resampling and calculating the integer window size feels unnecessary for the end goal.

max-sixty commented 4 years ago

We would definitely take a PR for this; and it might not be that difficult given it's already implemented in pandas.

snbentley commented 4 years ago

This would still be very useful to me in future - for the piece of work I was referring to here I came up with a workaround. I filled in the gaps roughly with NaNs, so that I could identify and remove outliers and other bad data. Only then could I use the resample functionality without smearing these artefacts across good data.

However, my solution was quite clunky and slow and was based on the still-mostly-regular resolution of my dataset, rather than any neater general solution in pandas. As I was (and am) also relatively new to Python I did not think this was appropriate to add to xarray myself, but I would like to say that I would definitely use this functionality in future - as would the other colleagues in space physics/meteorology I mentioned this to.

bhemmer commented 3 years ago

Is there a chance this might be added? I would also highly appreciate this feature.

hCraker commented 3 years ago

Hi all. This functionality can be done in xarray, but it's not a simple one line call. Currently this sort of functionality is being added to the geocat-comp repository in PR 158 https://github.com/NCAR/geocat-comp/pull/158 which should be merged and added to the July release here in the next few weeks. @dcherian perhaps we could chat about whether or not this should remain in geocat-comp as is or if it could be done more efficiently in xarray's backend

dcherian commented 3 years ago

@hCraker that's not right.

rolling works by using fixed length windows so what you are doing only works with evenly spaced data. What's being discussed here is rolling with windows of varying lengths. I don't know how to do that efficiently.

hCraker commented 3 years ago

I understand now. I was referring to the part of the opening comment talking about averages in an hour or minute windows. That can be done with a couple lines of code, but you're right that the data has to be evenly spaced. I'm not sure how to make the varied windows work at all (let alone making it efficient) so I will leave this to you all.

max-sixty commented 3 years ago

Pandas has this, so it's not intractable.

If you'd like the feature, add a 👍 to the issue or help it along by looking at what would be required / starting an implementation.

chiaral commented 1 year ago

Hello! Just adding a 👍 to this thread - and, since it is an old issue, wondering if this is on xarray roadmap somewhere. Something like .rolling(time='5M') would be really valuable for many applications. thanks so much for all your work! Chiara