lrberge / fixest

Fixed-effects estimations
https://lrberge.github.io/fixest/
361 stars 59 forks source link

CR3 Standard Errors #481

Open TomBearpark opened 3 months ago

TomBearpark commented 3 months ago

Hi, and thanks for a fantastic package!

@s3alfisc recently helped me with the following function, which calculates CR3 standard errors for a fixest object.

The CR3 SE can be estimated using a Jackknife, see Equation 18 of MacKinnon et al (2023).

The summclust package provides support for these SEs, but it doesn't work on fixest objects that are estimated with unit specific trends.

@s3alfisc suggested posting here, in case its useful to others, or could be incorporated into fixest in the future.

Thanks so much!

vcov3 = function(object, cluster){

  data = fixest:::fetch_data(object)
  data = as.data.frame(data) 
  cluster = data[,cluster]
  unique_cluster = as.vector(unique(cluster))
  G = length(unique(cluster))
  fml_all = object$fml_all  
  obj_call = object$call
  obj_call$data = quote(data[cluster != g,]) 

  beta_centered = lapply(c(unique_cluster), function(g) coef(eval(obj_call)) - coef(object)) 
  vcov = lapply(seq_along(unique_cluster), function(g) tcrossprod(as.matrix(beta_centered[[g]])))
  vcov = Reduce("+", vcov) * (G-1)/G 
  vcov
}
s3alfisc commented 3 months ago

@lrberge, I am happy to prepare a pull request for this in the future or help @TomBearpark prepare it; I recall that you mentioned that eventually you'd provide guidance on how to incorporate new vcov estimators. I believe the implementation of the CR3 estimator is so simply (and so valuable) that it merits to be a part of fixest. One other reason to include it is that for Tom's use case, sandwich::vcovBS() - which also implements the jackknife - fails as well, for similar reasons as summclust, which borrows it's data / cluster variable preprocessing implementation from sandwich. I think that both are unlikely to change because they require a more general data preprocessing solution not necessarily tailored to fixest.

kylebutts commented 3 months ago

Very slick code @TomBearpark. I have a hunch using est_env + putting 0 as weights for a cluster might avoid some overhead in this case.

I wonder if an argument to feols that operates like fsplit but instead does leave-out estimation leave_out might be generally useful, e.g. JIVE estimation.

kylebutts commented 3 months ago

Nerd-snipped into writing this. It's almost as fast as summclust::vcov_CR3J now. It's a bit slower since summclust uses smart algebra and demeaning by cluster FE to simplify the amount of calculations needed to be done (though I could imagine settings where it's faster with HDFE). The advantage of this function is that it relies completely on fixest for estimation, so should be more robust?

library(fixest)
library(summclust)

vcov3_orig <- function(x, cluster) {
  data <- fixest:::fetch_data(x)
  # data = as.data.frame(data) # This copies data needlessly
  cluster <- data[, cluster]
  unique_cluster <- as.vector(unique(cluster))
  G <- length(unique(cluster))
  fml_all <- x$fml_all
  obj_call <- x$call
  obj_call$data <- quote(data[cluster != g, ])

  beta_centered <- lapply(c(unique_cluster), function(g) coef(eval(obj_call)) - coef(x))
  vcov <- lapply(seq_along(unique_cluster), function(g) tcrossprod(as.matrix(beta_centered[[g]])))
  vcov <- Reduce("+", vcov) * (G - 1) / G
  vcov
}

vcov3 <- function(x, cluster, type = c("CRV3", "CRV3J")) {
  type <- match.arg(type)

  data <- fixest:::fetch_data(x)
  if (inherits(cluster, "formula")) cluster <- all.vars(cluster)
  if (inherits(cluster, "character")) {
    cluster <- data[[cluster]]
  }
  if (length(x$obs_selection) > 0) {
    cluster <- cluster[unlist(x$obs_selection), drop = FALSE]
  }

  unique_cluster <- unique(cluster)
  G <- length(unique_cluster)

  core_env <- update(x, only.coef = TRUE, only.env = TRUE)
  w <- if (is.null(x$weights)) rep(1, x$nobs) else x$weights
  beta_jack <- lapply(unique_cluster, function(c) {
    w_update <- w
    w_update[cluster == c] <- 0
    est_env(core_env, weights = w_update)
  })

  if (type == "CRV3J") {
    beta_center <- Reduce("+", beta_jack) / G
  } else if (type == "CRV3") {
    beta_center <- coef(x)
  }

  vcov <- matrix(0, nrow = length(beta_center), ncol = length(beta_center))
  vcov <- Reduce(
    function(V_sum, b) {
      V_sum + tcrossprod(b - beta_center)
    },
    beta_jack,
    init = vcov
  )
  vcov <- vcov * (G - 1) / G
  colnames(vcov) <- rownames(vcov) <- names(beta_center)

  return(vcov)
}

fit <- feols(Euros ~ log(dist_km) | Origin + Destination, data = trade)
bench::mark(
  vcov3_orig(fit, "Origin"),
  vcov3(fit, ~Origin),
  vcov_CR3J(fit, cluster = ~Origin, type = "CRV3", absorb_cluster_fixef = TRUE)[1, 1, drop = FALSE],
  vcov_CR3J(fit, cluster = ~Origin, type = "CRV3", absorb_cluster_fixef = FALSE)[1, 1, drop = FALSE]
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 "vcov3_orig(fit, \"Origin\")"     273.2ms 304.2ms      3.29   131.2MB     21.4
#> 2 "vcov3(fit, ~Origin)"             103.3ms 103.8ms      9.60    44.7MB     15.4
#> 3 "vcov_CR3J(fit, cluster = ~Origi…  62.9ms  65.6ms     13.2     97.7MB     58.7
#> 4 "vcov_CR3J(fit, cluster = ~Origi…  90.2ms  92.2ms      9.54   119.5MB     53.4

Created on 2024-03-26 with reprex v2.1.0

s3alfisc commented 3 months ago

Wow super cool @kylebutts! This is the type of problem where R really shines with functional programming and non standard evaluation! Yes, there are two advantages: this implementation is fixest native and allows users to do everything that fixest allows them to do: it works with varying slopes, IVs and non-linear models. None of this would not be available through sandwich::vcovBS, but sandwich fails in the same spot as summclust as both use more general methods to retrieve the cluster variable (which fails for some edge cases with fixest). I did some benchmarks back in the day and with very high dimensional fixed effects, the above should indeed be faster than summclust. Though I am no longer sure about if this still holds after our update to use sparse matrices with summclust 😀

kylebutts commented 3 months ago

@s3alfisc I don't know if you saw this, but in the newest version of fixest, you can now access the original dataset with an exported function: https://github.com/lrberge/fixest/issues/465 and https://lrberge.github.io/fixest/reference/fixest_data.html