tidymodels / rsample

Classes and functions to create and summarize resampling objects
https://rsample.tidymodels.org
Other
341 stars 66 forks source link

Block bootstrapping (eg time series) #32

Closed joranE closed 4 years ago

joranE commented 6 years ago

I can't quite make out how one might use this package to do block bootstraps (eg resampling disjoint groups, or a moving/overlapping block bootstrap) that might be used for time-series data.

Am I just missing how to do that, or if not, are there any plans to add that functionality?

topepo commented 6 years ago

I've never used that (or heard of it) since I don't do a lot of time series analysis.

You can get the moving blocks via rolling_origin. Does the bootstrapping occur in one (or both) of the blocks?

sheffe commented 6 years ago

(Apologies, this turned into a possibly-not-helpful novel.)

Not sure if what follows captures everything joranE wants, but a few thoughts. I would use the following designs frequently — and I think lots of social science, environmental science, etc would use these patterns pretty often. The rolling-origin approach for block bootstrapping is probably the core concept; most of what follows is just a derivation of that idea.

I can think of three cases. (1) Mixtures of time series with spatial elements (aka time-series cross-sectional data, aka panel data, aka spatiotemporal data), so you have multiple rows for each time period that capture unit observations at T. (2) You have some grouping/clustering features you might encounter (students within schools, subway lines within cities, etc) and you can plausibly assume independence between groups but not within groups. (3) A mixture of the first two cases, where you have time-series cross sectional data that also contain distinct cluster/group attributes.

In all three cases, you have a dataset where each row has index columns, for the timestamp and/or grouping columns, plus X and Y. An example for everything below, from a recent project. We’re doing climate predictions. We have monthly rainfall data at 1000 sites distributed across the US for 60 years. Our dataset is 60 years 12 months 1000 sites = 720,000 rows.** Y is rain in cm/month. X vars include trailing temp changes, precipitation, a dummy for el niño, etc. We want to predict the average monthly rainfall for the next 5 years from the timestamp in each row.

(**Frequently we see cases where new sites come online after the start of the time series, because observations began later, but will mention this case further down.)

You could go two directions here.

(1) Single-loop blocked sampling. This is pretty close to createTimeSlices in caret with some wrinkles.

(2) Double-loop blocked resampling. This is the case Max asked about, where you do a regular bootstrap within the blocks. The outer loop is the single-loop design mentioned above. The inner loop is resampling for N bootstrap replicates, which looks pretty standard.

Some cautions that apply to both cases:

(There are bundles of other useful diagnostics I can think of, but I don't think they would fall within this package's scope.)

TLDR — these cases can be complicated, but lots of important datasets fit the pattern.

If you think any of this is helpful/ want it somewhere in this package or others, I’d enjoy working through it it but would be my first OSS contribution. Do you have a guide for contributors/ practices you like?

joranE commented 6 years ago

I think rolling_origin probably covers what I would think of as a moving block bootstrap for time series, thanks. Additionally, I think I figured out how one would do a simple (disjoint) block resample, just by using nested data frames, i.e.

x <- as_tibble(mtcars) %>% nest(-cyl)
bootstraps(x,times = 10)

so you are bootstrapping the cyl groups, rather than the individual cases. And yeah, as @sheffe mentions things could get fairly complicated with resampling within the blocks as well, and combining simple and moving blocks, but maybe that's out of scope.

Another case to consider where this sort of complication comes up is (2d) spatial data, where again you might want to resample (potentially overlapping) spatial "blocks" defined by actual squares. This is a little more involved probably than the moving origin concept, and I don't actually have much experience with it. But basically these things arise when you want your resampling to preserve some element of correlation in your data, usually in time or space.

sheffe commented 6 years ago

@joranE that nest syntax is beautiful, thanks. Occurs to me, seeing your code, that it should be possible to do this over time/space like this too; something like

x <- as_tibble(df) %>% map(~mutate(.x, index = some_decision_rules(date, space)) %>% map(~nest(.x, -index) %>% map(~bootstraps(.x, times = 10)

and you can define some_decision_rules to be pretty custom to the problem as long as it spits out a unique ID for a group. I will try this over the weekend and write up a real-data example if it works.

(Hm. Could also exercise more control over the outer loop setup with a purr::crossing(rule1 = foo, rule2 = foo) in lieu of the call to mutate, to get all possible combinations -- that's probably a better solution for the moving blocks with specified time structures, and the other way would work for randomly-sampled spatial or cluster blocking. I'll play with it.)

topepo commented 6 years ago

Some of this could be accomplished using nested_cv and/or group_vfold_cv.

I haven't come up with a duration-based rolling origin forecasting method yet but that's on the horizon =]

I will try this over the weekend and write up a real-data example if it works.

That would be great. Even better if we could include it as a pkgdown article or blog post.

sheffe commented 6 years ago

@topepo Just coming up for air after a project. I'd be happy to write that up for a pkgdown or blog piece if it works. I'm optimistic. I am looking for a good public dataset that isn't too large, but it is likely easier/more self-contained to simulate a small handful of datasets reflecting each design problem. Does that strategy work for you?

topepo commented 6 years ago

@sheffe That sounds great.

Keep in mind that the pkgdown article does not have to be super computationally efficient (to a degree). Not all of the articles get translated into vignettes so we can work with real sized data sets. The data would need to be somewhere public (e.g. a package or github repo).

asbates commented 5 years ago

Maybe I'm misunderstanding but in the discussions here I don't see what I would think of as a time series block bootstrap. Once the blocks are defined, which could be done with rolling_origin, a block bootstrap would then do a bootstrap on the blocks themselves, not a bootstrap within the blocks.

For example, if we had data X_1, X_2, ..., X_10 then with a block length of 2, a single bootstrap version of the time series might look like (X_3, X_4), (X_9, X_10), (X_5, X_6), (X_4, X_5), (X_3, X_4). That is, block 1 would contain X_1, X_2; block 2 would contain X_2, X_3; and so on. Then we would randomly sample the blocks with replacement, maintaining the order the blocks are sampled as well as the order within the blocks. This would be a moving block bootstrap, sort of the fundamental time series bootstrap method. There are a lot of other methods that have different properties but they are all based on this idea.

This might be able to be done using a combination of the existing functions in the package but I haven't been able to figure it out. Regardless, I think there would be a real benefit to having a separate function. Please let me know if I can help out in any way.

topepo commented 4 years ago

I don't think that I'll be able to get to this any time soon; please feel welcome to put in a PR if you want.

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.