lrberge / fixest

Fixed-effects estimations
https://lrberge.github.io/fixest/
377 stars 59 forks source link

Speed of sandwich (HC2 or HC3) for larger datasets #386

Closed kennchua closed 1 year ago

kennchua commented 1 year ago

I understand that sandwich does not officially support fixest for HC2/HC3 vcov types. Though one workaround is specifying fixed effects as factor variables within feols formula.

For example, the following works:

library('tidyverse')
library('fixest')
library('sandwich')
library('nycflights13')
data("ChickWeight", package = "datasets") # small dataset
data("flights", package = "nycflights13") # larger dataset

lm_fit <- lm(formula = weight ~ Time + factor(Diet), data = ChickWeight)

crse_hc2 <- sqrt(diag(sandwich::vcovCL(lm_fit, cluster = ~ Chick, type = "HC2")))

feols_fit_hc2 <- feols(weight ~ Time + factor(Diet), data = ChickWeight, 
                       vcov = \(x) sandwich::vcovCL(x, cluster = ~ Chick,type = "HC2")) 

rbind(crse_hc2, se(feols_fit_hc2b))

However, I noticed that for larger datasets sandwich / fixest struggles to compute / does not run at all.

lm_fit <- lm(arr_delay ~ log(distance) + factor(month) + factor(carrier),
             data = flights) # arr_delay has missing values

crse_hc2 <- sqrt(diag(sandwich::vcovCL(lm_fit, cluster = flights |> 
                                         drop_na(arr_delay) |> # arr_delay has missing values
                                         pull(month), 
                                       type = "HC2"))) # takes too long to run

feols_fit <- feols(arr_delay ~ log(distance) + factor(month) + factor(carrier),
                   data = flights)

sandwich::vcovCL(feols_fit, cluster = (flights |> drop_na(arr_delay) |> pull(month)),
                 type = "HC2") # takes too logn to run

Is there a solution to computing small-sample cluster-robust standard errors (HC2, HC3 etc.) for bigger datasets?

I've tried looking into other packages like dfadjust (HC2) and summclust (CRV3), and they seem to compute faster than sandwich. I just place the vcov matrix generated by these inside feols as a workaround.

Thank you!

s3alfisc commented 1 year ago

Hi @kennchua, for fast CRV3 inference (cluster robust and HC3), check either sandwich::vcovJK(), which is identical to the CRV3 robust estimator if you choose center = 'estimate', or summclust::vcov_CR3J() / , which also implements the CRV3 estimator as a jackknife. One implementation may be faster than the other in different contexts, I have not yet systematically benchmarked the two against each other. I am not aware of a fast implementation of the HC2/CRV3, even though Niccodemi & Wansbeek have had some ideas. I hope this helps!

Edit:

Note that sandwich::vcovJK() is only still available through the dev version, which you can install via install.packages("sandwich", repos = "https://R-Forge.R-project.org").

kennchua commented 1 year ago

Thank you @s3alfisc. I will look into these solutions. Appreciate the help and the great work on summclust::vcov_CR3J!