Clustered standard error

K-Maehashi commented 2 years ago

Hello @mwelz, thanks for this great package! I have a feature request--if this package has a clustered SE option, it will be wonderful. (if we do a block randomization by clusters when we make splits, does it become a clustered SE?)

mwelz commented 2 years ago

Dear @K-Maehashi, thanks for the interest in our package. The package already supports clustered standard errors in the BLP and GATES regressions. You can specify this option via the setup_vcov() function. Below an example.

Concerning your next question:

(if we do a block randomization by clusters when we make splits, does it become a clustered SE?)

It's hard to make a generally valid statement about whether or not to adjust standard errors for clustering. I recommend this working paper by Abadie et al. (2017) for a detailed discussion, perhaps it is of help for you.

# please make sure to have the latest version of GenericML installed
library("GenericML")

## generate data
set.seed(1)
n  <- 200                                  # number of observations
p  <- 5                                    # number of covariates
D  <- rbinom(n, 1, 0.5)                    # random treatment assignment
Z  <- matrix(runif(n*p), n, p)             # design matrix
Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment
Y1 <- 2 + Y0                               # potential outcome under treatment
Y  <- ifelse(D == 1, Y1, Y0)               # observed outcome

# randomly sample cluster membership
cluster <-  sample(1:5, n, replace = TRUE)

# specify cluster-robust standard errors via vcovCL() in 'sandwich'
# and pass the clusters as arguments
vcov_BLP <- setup_vcov(estimator = "vcovCL",
                       arguments = list(cluster = cluster))
vcov_GATES <- vcov_BLP # same for GATES

# run GenericML (few splits to keep computation time low)
x <- GenericML(Z, D, Y, learners_GenericML = "lasso", num_splits = 10, 
               vcov_BLP = vcov_BLP, vcov_GATES = vcov_GATES, parallel = FALSE)

K-Maehashi commented 2 years ago

@mwelz Fantastic! I should have read the README file more carefully. Thank you so much for the detailed explanation. (This package works pretty well with ~20,000 obs. and I like the visualization plots! Somehow it eats a lot of memory but this package has everything I need!)

mwelz commented 2 years ago

@K-Maehashi Thanks for the kind feedback! Memory consumption can indeed be an issue, in particular for large datasets. Here are a few tips to save memory in GenericML():

If you use parallelism (parallel = TRUE), choosing a lower number of cores via num_cores may save memory (at the expense of longer computation time);
Setting the argument store_learners = FALSE;
Setting the argument prop_aux to a value smaller than the default of 0.5. This assigns a smaller number of observations to the auxiliary set. Since the memory-intensive estimation of the proxy learners takes place on this set, perhaps a smaller auxiliary set might save memory.

We might look into making GenericML() more memory-efficient in a future release; we have not yet optimized it for that.

mwelz commented 2 years ago

This issue hasn't seen activity in the last eight days, so I'll close it now. Please feel free to re-open if you think this issue hasn't been solved properly.

mwelz / GenericML

Clustered standard error #7