s3alfisc / summclust

R module for cluster specific information (as in the Stata summclust module)
https://s3alfisc.github.io/summclust/
Other
6 stars 2 forks source link

Sparse implementation of `vcov_CR3J` #19

Closed kylebutts closed 1 year ago

kylebutts commented 1 year ago
  1. Adds the option sparse to vcov_CR3J. This is substantially faster (see reprex), but (1) requires fixest::sparse_model_matrix to be merged (https://github.com/lrberge/fixest/pull/418) and (2) results in different beta_jack. This is fine for the vcov, but makes summclust not match Stata. All tests passed and I added a set of tests to make sure sparse/dense versions match each other.

  2. Added a little speedup to partial-leverage. In R, you basically never want to use solve(A) %*% b. It's an order of magnitude slower than Matrix::solve(A, b).

  3. Made two modifications to existing tests. For CR3J_vs_HC3, I dropped nlswork to 100 rows. It's much faster and effectively the same test. For the plot and the summary, I added a quietly function so it doesn't print out a bunch of results and open the plot pane when testing.


vcov_CR3J sparse example:

library(fixest)
library(Matrix)
library(summclust)
library(devtools)
#> Loading required package: usethis
# load_all()
# devtools::check(vignettes = FALSE, env_vars = c())

nlswork <- haven::read_dta("http://www.stata-press.com/data/r9/nlswork.dta")

# drop NAs at the moment
nlswork <- nlswork[, c(
  "ln_wage", "grade", "age", "birth_yr",
  "union", "race", "msp", "ind_code"
)]
nlswork <- na.omit(nlswork)

# 30 clusters
nlswork$group <- (seq_len(nrow(nlswork)) - 1) %% 30 + 1

lm_fit <- lm(
  ln_wage ~ 0 + factor(union) + factor(msp) + factor(race),
  data = nlswork
)
feols_fit <- feols(
  ln_wage ~ 0 + i(union) + i(msp) + i(race),
  data = nlswork
)
obj <- feols_fit

vcovCR3_lm <- vcov_CR3J(lm_fit, ~group, type = "CRV3")
sqrt(diag(vcovCR3_lm))[c("factor(union)0", "factor(union)1", "factor(msp)1", "factor(race)2", "factor(race)3")]
#> factor(union)0 factor(union)1   factor(msp)1  factor(race)2  factor(race)3 
#>    0.004831951    0.006739998    0.006764424    0.006009833    0.028693133

tictoc::tic("Sparse")
vcovCR3_sparse <- vcov_CR3J(feols_fit, ~group, type = "CRV3", sparse = TRUE)
sqrt(diag(vcovCR3_sparse))
#>    union::0    union::1      msp::1     race::2     race::3 
#> 0.004831951 0.006739998 0.006764424 0.006009833 0.028693133
tictoc::toc()
#> Sparse: 0.15 sec elapsed

tictoc::tic("Dense")
vcovCR3_dense <- vcov_CR3J(feols_fit, ~group, type = "CRV3", sparse = FALSE)
sqrt(diag(vcovCR3_dense))
#>    union::0    union::1      msp::1     race::2     race::3 
#> 0.004831951 0.006739998 0.006764424 0.006009833 0.028693133
tictoc::toc()
#> Dense: 1.989 sec elapsed

Created on 2023-06-10 with reprex v2.0.2


solve(A) %*% b vs. solve(A, b)

n <- 1000
A <- matrix(runif(n^2), nrow = n, ncol = n)
b <- runif(n)

bench::mark(
  as.numeric(solve(A) %*% b),
  as.numeric(solve(A, b)),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression                      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 as.numeric(solve(A) %*% b)    405ms    410ms      2.42      23MB    5.65 
#> 2 as.numeric(solve(A, b))       126ms    126ms      7.88    7.67MB    0.876

Created on 2023-06-10 with reprex v2.0.2

s3alfisc commented 1 year ago

Hi @kylebutts , thanks so much for this PR. The speed gains look amazing. I have taken a first look at the code, and I think it looks excellent, and I will be more than happy to accept this PR once the sparse model matrix makes it into fixest.

I have a few comments, which I am listing here (but I will also comment in the PR). Some of these, I might just fix myself as I go through the code line by line - in case I do this, I will mark it as done (I hope that's ok with you?).

Other general comments I had:

I'll play around with the code a little bit more and might add to this list. Anyways, thanks a lot - this is a really exciting PR! =)

Best, Alex

kylebutts commented 1 year ago

MPinv from VCA package produces the same results as MASS::ginv. It passes all the tests and should work now. However, still using sparse produces different results on summclust.

The reason I pushed onto this PR is that vcov_CR3J.fixest now works fully without densifying (may be a made up word) matrices