Open TomBearpark opened 7 months ago
@lrberge, I am happy to prepare a pull request for this in the future or help @TomBearpark prepare it; I recall that you mentioned that eventually you'd provide guidance on how to incorporate new vcov estimators. I believe the implementation of the CR3 estimator is so simply (and so valuable) that it merits to be a part of fixest
. One other reason to include it is that for Tom's use case, sandwich::vcovBS()
- which also implements the jackknife - fails as well, for similar reasons as summclust
, which borrows it's data / cluster variable preprocessing implementation from sandwich
. I think that both are unlikely to change because they require a more general data preprocessing solution not necessarily tailored to fixest
.
Very slick code @TomBearpark. I have a hunch using est_env
+ putting 0 as weights for a cluster might avoid some overhead in this case.
I wonder if an argument to feols
that operates like fsplit
but instead does leave-out estimation leave_out
might be generally useful, e.g. JIVE estimation.
Nerd-snipped into writing this. It's almost as fast as summclust::vcov_CR3J
now. It's a bit slower since summclust
uses smart algebra and demeaning by cluster FE to simplify the amount of calculations needed to be done (though I could imagine settings where it's faster with HDFE). The advantage of this function is that it relies completely on fixest
for estimation, so should be more robust?
library(fixest)
library(summclust)
vcov3_orig <- function(x, cluster) {
data <- fixest:::fetch_data(x)
# data = as.data.frame(data) # This copies data needlessly
cluster <- data[, cluster]
unique_cluster <- as.vector(unique(cluster))
G <- length(unique(cluster))
fml_all <- x$fml_all
obj_call <- x$call
obj_call$data <- quote(data[cluster != g, ])
beta_centered <- lapply(c(unique_cluster), function(g) coef(eval(obj_call)) - coef(x))
vcov <- lapply(seq_along(unique_cluster), function(g) tcrossprod(as.matrix(beta_centered[[g]])))
vcov <- Reduce("+", vcov) * (G - 1) / G
vcov
}
vcov3 <- function(x, cluster, type = c("CRV3", "CRV3J")) {
type <- match.arg(type)
data <- fixest:::fetch_data(x)
if (inherits(cluster, "formula")) cluster <- all.vars(cluster)
if (inherits(cluster, "character")) {
cluster <- data[[cluster]]
}
if (length(x$obs_selection) > 0) {
cluster <- cluster[unlist(x$obs_selection), drop = FALSE]
}
unique_cluster <- unique(cluster)
G <- length(unique_cluster)
core_env <- update(x, only.coef = TRUE, only.env = TRUE)
w <- if (is.null(x$weights)) rep(1, x$nobs) else x$weights
beta_jack <- lapply(unique_cluster, function(c) {
w_update <- w
w_update[cluster == c] <- 0
est_env(core_env, weights = w_update)
})
if (type == "CRV3J") {
beta_center <- Reduce("+", beta_jack) / G
} else if (type == "CRV3") {
beta_center <- coef(x)
}
vcov <- matrix(0, nrow = length(beta_center), ncol = length(beta_center))
vcov <- Reduce(
function(V_sum, b) {
V_sum + tcrossprod(b - beta_center)
},
beta_jack,
init = vcov
)
vcov <- vcov * (G - 1) / G
colnames(vcov) <- rownames(vcov) <- names(beta_center)
return(vcov)
}
fit <- feols(Euros ~ log(dist_km) | Origin + Destination, data = trade)
bench::mark(
vcov3_orig(fit, "Origin"),
vcov3(fit, ~Origin),
vcov_CR3J(fit, cluster = ~Origin, type = "CRV3", absorb_cluster_fixef = TRUE)[1, 1, drop = FALSE],
vcov_CR3J(fit, cluster = ~Origin, type = "CRV3", absorb_cluster_fixef = FALSE)[1, 1, drop = FALSE]
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 "vcov3_orig(fit, \"Origin\")" 273.2ms 304.2ms 3.29 131.2MB 21.4
#> 2 "vcov3(fit, ~Origin)" 103.3ms 103.8ms 9.60 44.7MB 15.4
#> 3 "vcov_CR3J(fit, cluster = ~Origi… 62.9ms 65.6ms 13.2 97.7MB 58.7
#> 4 "vcov_CR3J(fit, cluster = ~Origi… 90.2ms 92.2ms 9.54 119.5MB 53.4
Created on 2024-03-26 with reprex v2.1.0
Wow super cool @kylebutts! This is the type of problem where R really shines with functional programming and non standard evaluation! Yes, there are two advantages: this implementation is fixest native and allows users to do everything that fixest allows them to do: it works with varying slopes, IVs and non-linear models. None of this would not be available through sandwich::vcovBS, but sandwich fails in the same spot as summclust as both use more general methods to retrieve the cluster variable (which fails for some edge cases with fixest). I did some benchmarks back in the day and with very high dimensional fixed effects, the above should indeed be faster than summclust. Though I am no longer sure about if this still holds after our update to use sparse matrices with summclust 😀
@s3alfisc I don't know if you saw this, but in the newest version of fixest
, you can now access the original dataset with an exported function: https://github.com/lrberge/fixest/issues/465 and https://lrberge.github.io/fixest/reference/fixest_data.html
Hi, and thanks for a fantastic package!
@s3alfisc recently helped me with the following function, which calculates CR3 standard errors for a
fixest
object.The CR3 SE can be estimated using a Jackknife, see Equation 18 of MacKinnon et al (2023).
The summclust package provides support for these SEs, but it doesn't work on fixest objects that are estimated with unit specific trends.
@s3alfisc suggested posting here, in case its useful to others, or could be incorporated into
fixest
in the future.Thanks so much!