Account for hierarchical structure

lminer commented 6 years ago

Right now GRF is based on an IID assumption. It would be nice to be able to use GRF on data with a hierarchical structure. This is especially relevant in RCTs where studies occur across administrative units like provinces, towns, school districts, etc.

lminer commented 6 years ago

@jtibshirani I'd be happy to give this a try. Do you know of any paper/other repo that might give me a sense about how I might need to alter the sampling strategy after passing in the requisite information?

swager commented 6 years ago

Thanks, @lminer! First, a few thoughts as preliminaries:

The point estimates given by the current GRF implementation should be fine even with hierarchical structure; however, the confidence intervals clearly need to be adjusted for failure of the IID assumption, like you said.
As described in Section 5.1 here (https://arxiv.org/pdf/1610.01271v3.pdf), the core idea of our confidence intervals is built around a half-sampling variance estimator. In fact, the estimator (51) is just pure half-sampling (but is computationally intractable); then (52) develops a computationally tractable "bootstrap of little bags" approximation to (51) following Sexton and Laake (2009).
To account for non-IID data, the half-sampling behind the computationally infeasible optimal estimator (51) needs to be changed in the usual way (e.g., half-sample towns rather than individuals); however, the "little bags" approximation and resulting Monte Carlo correction can be left unchanged.
Aside: the simplest example of this Monte Carlo correction in the code is in RegressionPredictionStrategy::compute_variance in this file.

So anyways, noting these preliminaries, you can think of the forest as just doing a half-sampling bootstrap (along with some tricks that make it computationally tractable, but that don't matter from the perspective of the IID/non-IID question). Thus, to make the CIs robust to non-IID data, we just need to make the "usual" modification to half-sampling bootstrap, i.e., cluster similar items during half-sampling.

The way we implement the bootstrap of little bags in the code is that we train trees in groups of size ci_group_size, and each of these tree groups only uses data sampled from the same half sample. You can see these half-samples being generated here. At the very least, you'll need to modify this line (and pipe down the required group information needed to do so). I'll let @jtibshirani chime in in case there's anything else we'd need to worry about?

lminer commented 6 years ago

@swager thanks for such a detailed exposition. Seems fairly straightforward if we're only clustering at one level. Do you have a sense of what the procedure would be if we have multiple levels that are nested? I'm assuming non-nested multi-way clustering is complicated and I shouldn't bother with that.

nredell commented 6 years ago

@lminer, I have not looked at the code for either of these R packages so I can't speak to bias or CI coverage, but the articles were good reads. The REEMtree package handles a variety of nested or clustered relationships both cross-sectional and longitudinal: https://cran.r-project.org/web/packages/REEMtree/index.html.

And glmertree handles nested data if isolating a treatment effect across subgroups is what you're after: https://cran.r-project.org/web/packages/glmertree/index.html

lminer commented 6 years ago

Still trying to find an authoritative article on this. The only thing that I can find about boostrap standard errors when there are multiple levels is this. It suggests that if you had two levels like city and school, you would first randomly sample a city and then from within that city you would randomly sample a school and use that as your unit of sampling. Does this seem right?

lminer commented 6 years ago

@swager, @jtibshirani I propose implementing the following.

We'll add two extra arguments

clusters: (optional argument) a vector of numbers or factors, specifying the clusters
samples_per_cluster: (optional argument) the number of observations to be sampled from each sampled cluster

We're only going to implement clustering at a single level. In order to alleviate issues associated with clusters of wildly different sizes, the clustering will work as follows. If samples_per_cluster is not specified, we will set it equal to the size of the smallest cluster. When sampling for the bootstrap, we will half-sample by clusters. Rather than taking all observations in a sampled cluster, we will take a sample from selected cluster of size samples_per_cluster.

One question. Does it make sense to provide this option for all the forests in the package: instrumental, causal, quantile, regression? Yes, right?

swager commented 6 years ago

Sounds great, thanks! One minor thing: How about calling the second argument samples_per_cluster instead? And yes, we should provide the option to all the forest types.

lminer commented 6 years ago

@swager Digging into this more, I see that there are several different sampling options.

bootstrap
bootstrap without replacement
bootstrap weighted
bootstrap weighted without replacement

Do I need to implement each of these options for the hierarchical use case or is plain bootstrap enough?

Also, after I've chosen the clusters, how should I sample observations from the clusters? With replacement or without replacement?

lminer commented 6 years ago

@swager I think I've got the basic implementation down for training. Do I need to make any adjustments for prediction?

swager commented 6 years ago

Great! Unless I'm missing something, I think prediction should be OK as already implemented. As to your previous point (sorry for missing it earlier), everything in GRF so far runs on bootstrap sampling without replacement, so that's the most important case.

lminer commented 6 years ago

@swager that makes it easier. Last two questions. Should I sample from within the clusters with or without replacement? For the oob sample, do I include observations from sampled clusters that haven't themselves been sampled?

swager commented 6 years ago

It's most consistent with the rest if all sampling is without replacement (including within clusters). For the OOB sample, we should only include observations from clusters that haven't been used at all (because an OOB sample is supposed to be independent from the tree that was grown; and, in case of cluster-wide correlations, a sample may be correlated with a tree prediction whenever the tree used a training example from the same cluster as the sample).

Finally, for the same reason: In the case of subsample splitting for honesty, we should make sure that we split the subsample along cluster boundaries (i.e., all samples from the same cluster end up in the same half).

lminer commented 6 years ago

Got it. For the OOB sample, do we include all observations in a cluster or just samples_per_cluster?

lminer commented 6 years ago

@swager now that we need to subsample along cluster boundaries, I have another few questions. Basically, where in the code do I need to make this change?

Do I only subsample along clusters after checking for honesty here in treetrainer.cpp where we explicitly check for honest?
Or do we also subsample along clusters before then. So here as well.
Finally, do we only do this when calculating confidence intervals (tree.train called from train_ci_group), or should we also subsample along clusters for the point estimates?

swager commented 6 years ago

We should subsample along clusters in all cases during training, including those in the first two bullets.

Then, given that during training we did all subsampling along clusters, the prediction code should be able to run verbatim (including confidence intervals). The reason for this is that the uncertainty quantification at prediction time is driven by the sampling in the training phase, so if the sampling is already cluster-robust, the uncertainty quantification will also be cluster-robust.

grf-labs / grf

Account for hierarchical structure #178