HC2, HC3, and Heteroskedasticity-Unbiased Standard Errors

kylebutts commented 1 year ago

Hi Laurent,

I'm writing to see if there is interest in a pull-request to add HC2, HC3, and Heteroskedasiticity-Unbiased standard errors into fixest. HC2 and HC3 are useful for small samples (e.g. see http://datacolada.org/99); the latter is useful for high-dimensional models. For example, unbiased estimates of the variance of teacher fixed-effects you need to have a HU estimator.

A little summary (for completeness)

HC1 estimates $\hat{\Omega}$ as a diagonal matrix with $t^{th}$ element $\frac{n}{n - k} \hat{u}_{t}^2$.

HC2 replaces with $\hat{u}_{t}^2 / (1-h_t)$ where $h_t$ is the $t^{th}$ diagonal element of the projection matrix.

HC3 with $\hat{u}_{t}^2 / (1-h_t)^2$.

HU uses $y_t \hat{u}_t / (1 - h_t)$.

Therefore the main thing needed is the ability to quickly calculate the diagonal of the projection matrix. We already have the code to make the sparse model matrix (from your help in github.com/kylebutts/did2s). For low-dimensions or with just fixed-effects, it's super fast to get the $h_t$ vector exactly (by leveraging sparseness). With large $n$, covariates and high-dimensional fixed effects, then $h_t$ can be approximated by a Johnson-Lindenstrauss approximation (from Kline, Saggio, and Sølvsten). I have code for exact and approximate calculation of $h_t$, but I can probably go to a lower level and parallelize it for more speedup.

Let me know if this would be a good fit within fixest; otherwise I can make a small package that just calculates the $V$ and then can be "inserted" back into the feols object with summary.

kylebutts commented 1 year ago

Initial implementation is very fast: https://gist.github.com/kylebutts/ea05599a99b86e29ad2f39956e339ab4

I think I know how to speed it up a bunch. The main step is solving a linear model with a high-dimensional X, so I think fixest can be useful to speed that up (relative to sparse matrix solver). However, doing it this way is slower:

est = feols(y ~ x | id, data = df)

X = did2s:::sparse_model_matrix(df, est)
X = cbind(X$mat_RHS, X$mat_FE)
p = 1000
n = nrow(df)

# Using fixest to solve for Z
P_ii = rep(0, nrow(X))
for (j in 1:p) {
  # Draw jth draw of n-vector of Rademacher weights
  df$Rp_j <- rbinom(n, 1, 0.5) * 2 - 1

  # Z_j = (X'X) X' Rp_j
  est_Rp_j = feols(Rp_j ~ x | fe1 + fe2, df)
  Zj = Zj = c(coef(est_Rp_j), unlist(fixef(est_Rp_j)))

  # || X Z_j ||^2 / p
  P_ii = P_ii + as.numeric(X %*% Zj)^2 / p
}

I think I can use fixest's built in multiple estimation to speed this up by generating many Rp_j at a time and using multiple estimation. For memory reasons, though, I can't generate all $p$ vectors are once with large $n$. I'm wondering if you think there's a way to do multiple estimation without needing to have each outcome variable generated? I think that could speed this up to fixest speeds :-)

s3alfisc commented 1 year ago

Hi @kylebutts - not sure if I misunderstand your question, but I think that you could simply use fixest::demean(), which runs the basic MAP algorithm for a matrix X and a list of fixed effects f, and (as the function name says) returns X demeaned by f. This way, you would call demean() only once on the rhs of your model and get the demeaned covariates. In a second step, you could either loop over all p vectors of rademacher weights, or alternatively loop over matrices of rademacher weights. As the demeaning of each dependent variable is independent of all others, it does not matter if you demean them individually or in chunks. (This is where fixest multiple estimation speedups for OLS stem from - if a variable has been demeaned in the first OLS estimation, it's possible to cache those results and to skip the demeaning step for the second model. Note that I am saying this without understanding the fixest c++ source code, but this must be how it's done 😅 ).

library(fixest)
N <- 1000
k <- 10
X <- matrix(rnorm(N*k, 0, 1), N, k)
f <- sample(1:10, N, TRUE)

res_loop <- X
for(i in 1:k){
  res_loop[,i] <- fixest::demean(X[,i], f)
}

res_all <- fixest::demean(X, f)

all.equal(res_loop, res_all)
#[1] TRUE

kylebutts commented 1 year ago

Closed in favor of #418

lrberge / fixest

HC2, HC3, and Heteroskedasticity-Unbiased Standard Errors #414