Block bootstrapping (eg time series)

joranE commented 6 years ago

I can't quite make out how one might use this package to do block bootstraps (eg resampling disjoint groups, or a moving/overlapping block bootstrap) that might be used for time-series data.

Am I just missing how to do that, or if not, are there any plans to add that functionality?

topepo commented 6 years ago

I've never used that (or heard of it) since I don't do a lot of time series analysis.

You can get the moving blocks via rolling_origin. Does the bootstrapping occur in one (or both) of the blocks?

sheffe commented 6 years ago

(Apologies, this turned into a possibly-not-helpful novel.)

Not sure if what follows captures everything joranE wants, but a few thoughts. I would use the following designs frequently — and I think lots of social science, environmental science, etc would use these patterns pretty often. The rolling-origin approach for block bootstrapping is probably the core concept; most of what follows is just a derivation of that idea.

I can think of three cases. (1) Mixtures of time series with spatial elements (aka time-series cross-sectional data, aka panel data, aka spatiotemporal data), so you have multiple rows for each time period that capture unit observations at T. (2) You have some grouping/clustering features you might encounter (students within schools, subway lines within cities, etc) and you can plausibly assume independence between groups but not within groups. (3) A mixture of the first two cases, where you have time-series cross sectional data that also contain distinct cluster/group attributes.

In all three cases, you have a dataset where each row has index columns, for the timestamp and/or grouping columns, plus X and Y. An example for everything below, from a recent project. We’re doing climate predictions. We have monthly rainfall data at 1000 sites distributed across the US for 60 years. Our dataset is 60 years 12 months 1000 sites = 720,000 rows.** Y is rain in cm/month. X vars include trailing temp changes, precipitation, a dummy for el niño, etc. We want to predict the average monthly rainfall for the next 5 years from the timestamp in each row.

(**Frequently we see cases where new sites come online after the start of the time series, because observations began later, but will mention this case further down.)

You could go two directions here.

(1) Single-loop blocked sampling. This is pretty close to createTimeSlices in caret with some wrinkles.

You split test/train into N non-overlapping, ordered time blocks that are defined by the unique timestamps and grab all observed groups/units matching the timestamps for test/train respectively. This could be rolling origin, OR always beginning from t0, OR a uniformly-sampled time length for the train and test blocks (usually, sampled with a defined min/max for the unif). Pluses/minuses to each, can do a fast lit search if useful.
For the time data, you may need to impose some interval to separate test and train (rather than just non overlapping). In the example above (“predict next 5 year average”) your X vars could end in 2000, but you’re making a prediction using label information out to 2005. So you would want (minimally) a 5-year separation between the test-train blocks. That could be a defined separation, or an interval uniformly sampled within some min/max constraints (in this case, min 5/max 10 for example).
You could split test-train into time blocks, but also split by group blocks — e.g. east of Mississippi river in train, west of Mississippi river in test; or a rule like “no two groups separated by <N km can show up in both test and train”; or (like the test scores/schools/teacher example) we can’t have teachers at the same school in both test and train. This can be pretty complicated especially with spatial rules; less complex with known discrete groups/clusters.

(2) Double-loop blocked resampling. This is the case Max asked about, where you do a regular bootstrap within the blocks. The outer loop is the single-loop design mentioned above. The inner loop is resampling for N bootstrap replicates, which looks pretty standard.

Some cautions that apply to both cases:

Vars can get pretty sparse unless you have a lot of data, so the checks for NZV and linear combinations can get hard. Upsampling/downsampling for class balance, same problem.
A special case of that issue: If you have different start/end dates for observations, like “my weather station A was built in 1983, B in 1992, C in 1997” etc, then a test/train split might want to make sure that both sets contain some observations of every cluster ID if you want to include cluster as a predictor to capture hierarchical-ness of the data.
Pretty often, you see cases where the distributions of X and Y features look nothing alike due to short or idiosyncratic block selections, and you often want diagnostics beyond the 5-number summary of performance metrics from caret. I find it helpful to think about the distributions of both your X and label vars within test and train splits with e.g. Kolmogorov Smirnov test for similarity of distribution. Think about e.g. climate change, where we know that the underlying system is changing and that test set X/labels may be substantially unlike the training set X/labels. In most cases that distribution (dis)similarity can be a helpful visualization / diagnostic on its own, and often I check model performance scores to see if certain fits are relatively better/worse under conditions when the test/train distributions are substantially more/less similar.

(There are bundles of other useful diagnostics I can think of, but I don't think they would fall within this package's scope.)

TLDR — these cases can be complicated, but lots of important datasets fit the pattern.

If you think any of this is helpful/ want it somewhere in this package or others, I’d enjoy working through it it but would be my first OSS contribution. Do you have a guide for contributors/ practices you like?

joranE commented 6 years ago

I think rolling_origin probably covers what I would think of as a moving block bootstrap for time series, thanks. Additionally, I think I figured out how one would do a simple (disjoint) block resample, just by using nested data frames, i.e.

x <- as_tibble(mtcars) %>% nest(-cyl)
bootstraps(x,times = 10)

so you are bootstrapping the cyl groups, rather than the individual cases. And yeah, as @sheffe mentions things could get fairly complicated with resampling within the blocks as well, and combining simple and moving blocks, but maybe that's out of scope.

Another case to consider where this sort of complication comes up is (2d) spatial data, where again you might want to resample (potentially overlapping) spatial "blocks" defined by actual squares. This is a little more involved probably than the moving origin concept, and I don't actually have much experience with it. But basically these things arise when you want your resampling to preserve some element of correlation in your data, usually in time or space.

sheffe commented 6 years ago

@joranE that nest syntax is beautiful, thanks. Occurs to me, seeing your code, that it should be possible to do this over time/space like this too; something like

x <- as_tibble(df) %>% map(~mutate(.x, index = some_decision_rules(date, space)) %>% map(~nest(.x, -index) %>% map(~bootstraps(.x, times = 10)

and you can define some_decision_rules to be pretty custom to the problem as long as it spits out a unique ID for a group. I will try this over the weekend and write up a real-data example if it works.

(Hm. Could also exercise more control over the outer loop setup with a purr::crossing(rule1 = foo, rule2 = foo) in lieu of the call to mutate, to get all possible combinations -- that's probably a better solution for the moving blocks with specified time structures, and the other way would work for randomly-sampled spatial or cluster blocking. I'll play with it.)

topepo commented 6 years ago

Some of this could be accomplished using nested_cv and/or group_vfold_cv.

I haven't come up with a duration-based rolling origin forecasting method yet but that's on the horizon =]

I will try this over the weekend and write up a real-data example if it works.

That would be great. Even better if we could include it as a pkgdown article or blog post.

sheffe commented 6 years ago

@topepo Just coming up for air after a project. I'd be happy to write that up for a pkgdown or blog piece if it works. I'm optimistic. I am looking for a good public dataset that isn't too large, but it is likely easier/more self-contained to simulate a small handful of datasets reflecting each design problem. Does that strategy work for you?

topepo commented 6 years ago

@sheffe That sounds great.

Keep in mind that the pkgdown article does not have to be super computationally efficient (to a degree). Not all of the articles get translated into vignettes so we can work with real sized data sets. The data would need to be somewhere public (e.g. a package or github repo).

asbates commented 5 years ago

Maybe I'm misunderstanding but in the discussions here I don't see what I would think of as a time series block bootstrap. Once the blocks are defined, which could be done with rolling_origin, a block bootstrap would then do a bootstrap on the blocks themselves, not a bootstrap within the blocks.

For example, if we had data X_1, X_2, ..., X_10 then with a block length of 2, a single bootstrap version of the time series might look like (X_3, X_4), (X_9, X_10), (X_5, X_6), (X_4, X_5), (X_3, X_4). That is, block 1 would contain X_1, X_2; block 2 would contain X_2, X_3; and so on. Then we would randomly sample the blocks with replacement, maintaining the order the blocks are sampled as well as the order within the blocks. This would be a moving block bootstrap, sort of the fundamental time series bootstrap method. There are a lot of other methods that have different properties but they are all based on this idea.

This might be able to be done using a combination of the existing functions in the package but I haven't been able to figure it out. Regardless, I think there would be a real benefit to having a separate function. Please let me know if I can help out in any way.

topepo commented 4 years ago

I don't think that I'll be able to get to this any time soon; please feel welcome to put in a PR if you want.

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

tidymodels / rsample

Block bootstrapping (eg time series) #32