lrberge / fixest

Fixed-effects estimations
https://lrberge.github.io/fixest/
361 stars 59 forks source link

Sparse model matrix, hatvalues, and HC2/HC3/HU standard errors #418

Open kylebutts opened 1 year ago

kylebutts commented 1 year ago

I've added sparse_model_matrix implementation.

Some development notes:

  1. The code started from what we had. I reworked the way the fixed effect matrices were generated to (i) work with time-varying slopes and (ii) work when sparse_model_matrix is passed a formula instead of a fixest object. I used fixef_terms to get the fe_vars and slope_var_list names. Then prepare_df gives me the variables (with ^ interactions). Then, making a sparse matrix with them approximately follows from what you wrote
library(fixest)
sp = sparse_model_matrix(
  wt ~ hp | cyl + cyl^gear + cyl[mpg], 
  mtcars,
  type = "fixef",
  collin.rm = FALSE
)
sp
#> 32 x 14 sparse Matrix of class "dgCMatrix"
#>   [[ suppressing 14 column names 'cyl::4', 'cyl::6', 'cyl::8' ... ]]
#>                                           
#>  [1,] . 1 .  .   21.0  .   . . . . 1 . . .
#>  [2,] . 1 .  .   21.0  .   . . . . 1 . . .
#>  [3,] 1 . . 22.8  .    .   . 1 . . . . . .
#>  [4,] . 1 .  .   21.4  .   . . . 1 . . . .
#>  [5,] . . 1  .    .   18.7 . . . . . . 1 .
#>  [6,] . 1 .  .   18.1  .   . . . 1 . . . .
#>  [7,] . . 1  .    .   14.3 . . . . . . 1 .
#>  [8,] 1 . . 24.4  .    .   . 1 . . . . . .
#>  [9,] 1 . . 22.8  .    .   . 1 . . . . . .
#> [10,] . 1 .  .   19.2  .   . . . . 1 . . .
#> [11,] . 1 .  .   17.8  .   . . . . 1 . . .
#> [12,] . . 1  .    .   16.4 . . . . . . 1 .
#> [13,] . . 1  .    .   17.3 . . . . . . 1 .
#> [14,] . . 1  .    .   15.2 . . . . . . 1 .
#> [15,] . . 1  .    .   10.4 . . . . . . 1 .
#> [16,] . . 1  .    .   10.4 . . . . . . 1 .
#> [17,] . . 1  .    .   14.7 . . . . . . 1 .
#> [18,] 1 . . 32.4  .    .   . 1 . . . . . .
#> [19,] 1 . . 30.4  .    .   . 1 . . . . . .
#> [20,] 1 . . 33.9  .    .   . 1 . . . . . .
#> [21,] 1 . . 21.5  .    .   1 . . . . . . .
#> [22,] . . 1  .    .   15.5 . . . . . . 1 .
#> [23,] . . 1  .    .   15.2 . . . . . . 1 .
#> [24,] . . 1  .    .   13.3 . . . . . . 1 .
#> [25,] . . 1  .    .   19.2 . . . . . . 1 .
#> [26,] 1 . . 27.3  .    .   . 1 . . . . . .
#> [27,] 1 . . 26.0  .    .   . . 1 . . . . .
#> [28,] 1 . . 30.4  .    .   . . 1 . . . . .
#> [29,] . . 1  .    .   15.8 . . . . . . . 1
#> [30,] . 1 .  .   19.7  .   . . . . . 1 . .
#> [31,] . . 1  .    .   15.0 . . . . . . . 1
#> [32,] 1 . . 21.4  .    .   . 1 . . . . . .

colnames(sp)
#>  [1] "cyl::4"        "cyl::6"        "cyl::8"        "cyl[[mpg]]::4"
#>  [5] "cyl[[mpg]]::6" "cyl[[mpg]]::8" "cyl^gear::4_3" "cyl^gear::4_4"
#>  [9] "cyl^gear::4_5" "cyl^gear::6_3" "cyl^gear::6_4" "cyl^gear::6_5"
#> [13] "cyl^gear::8_3" "cyl^gear::8_5"
  1. When no data argument is passed, the original data is used. When na.rm == TRUE, all the rows with NA in the original regression are dropped. If data is passed (even if it's the same data), only the NAs in the particular part of the model.matrix are dropped (e.g. if type = "lhs" is used, then only NAs on the LHS are dropped). Alternatively, I could update the regression object, but I decided against it since that would also cause confusions. For example, "why are rows in y being dropped, I don't have any NAs there"

  2. You can pass a formula directly to sparse_model_matrix without even estimating for type %in% c("lhs", "rhs", "fixest"), but if you want na.rm = TRUE or collin.rm = TRUE, then the model needs to be estimated. This is useful as a "sparse matrix factory function"

  3. I didn't implement subset. I can, but the code seems complicated.

  4. I copied all the tests from model.matrix and added a few more pertaining to fixed effects.

  5. I didn't dive into getting poly to work when data is passed. I could think about trying to have a rough warning that comes when "poly" appears in the formula and data is passed?

  6. There is the option combine which when equal to FALSE, returns a named list with sparse matrices for each element in type.

Examples

library(fixest)
est = feols(mpg ~ drat | cyl, mtcars)

sparse_model_matrix(est, type = "lhs")
#> 32 x 1 sparse Matrix of class "dgCMatrix"
#>        mpg
#>  [1,] 21.0
#>  [2,] 21.0
#>  [3,] 22.8
#>  [4,] 21.4
#>  [5,] 18.7
#>  [6,] 18.1
#>  [7,] 14.3
#>  [8,] 24.4
#>  [9,] 22.8
#> [10,] 19.2
#> [11,] 17.8
#> [12,] 16.4
#> [13,] 17.3
#> [14,] 15.2
#> [15,] 10.4
#> [16,] 10.4
#> [17,] 14.7
#> [18,] 32.4
#> [19,] 30.4
#> [20,] 33.9
#> [21,] 21.5
#> [22,] 15.5
#> [23,] 15.2
#> [24,] 13.3
#> [25,] 19.2
#> [26,] 27.3
#> [27,] 26.0
#> [28,] 30.4
#> [29,] 15.8
#> [30,] 19.7
#> [31,] 15.0
#> [32,] 21.4

sparse_model_matrix(est, type = c("rhs", "fixef"))
#> 32 x 4 sparse Matrix of class "dgCMatrix"
#>       drat cyl::4 cyl::6 cyl::8
#>  [1,] 3.90      .      1      .
#>  [2,] 3.90      .      1      .
#>  [3,] 3.85      1      .      .
#>  [4,] 3.08      .      1      .
#>  [5,] 3.15      .      .      1
#>  [6,] 2.76      .      1      .
#>  [7,] 3.21      .      .      1
#>  [8,] 3.69      1      .      .
#>  [9,] 3.92      1      .      .
#> [10,] 3.92      .      1      .
#> [11,] 3.92      .      1      .
#> [12,] 3.07      .      .      1
#> [13,] 3.07      .      .      1
#> [14,] 3.07      .      .      1
#> [15,] 2.93      .      .      1
#> [16,] 3.00      .      .      1
#> [17,] 3.23      .      .      1
#> [18,] 4.08      1      .      .
#> [19,] 4.93      1      .      .
#> [20,] 4.22      1      .      .
#> [21,] 3.70      1      .      .
#> [22,] 2.76      .      .      1
#> [23,] 3.15      .      .      1
#> [24,] 3.73      .      .      1
#> [25,] 3.08      .      .      1
#> [26,] 4.08      1      .      .
#> [27,] 4.43      1      .      .
#> [28,] 3.77      1      .      .
#> [29,] 4.22      .      .      1
#> [30,] 3.62      .      1      .
#> [31,] 3.54      .      .      1
#> [32,] 4.11      1      .      .

sparse_model_matrix(
  mpg ~ i(vs) | gear^cyl, data = mtcars, 
  type = c("rhs", "fixef"),
  collin.rm = FALSE
)
#> 32 x 10 sparse Matrix of class "dgCMatrix"
#>   [[ suppressing 10 column names 'vs::0', 'vs::1', 'gear^cyl::3_4' ... ]]
#>                          
#>  [1,] 1 . . . . . 1 . . .
#>  [2,] 1 . . . . . 1 . . .
#>  [3,] . 1 . . . 1 . . . .
#>  [4,] . 1 . 1 . . . . . .
#>  [5,] 1 . . . 1 . . . . .
#>  [6,] . 1 . 1 . . . . . .
#>  [7,] 1 . . . 1 . . . . .
#>  [8,] . 1 . . . 1 . . . .
#>  [9,] . 1 . . . 1 . . . .
#> [10,] . 1 . . . . 1 . . .
#> [11,] . 1 . . . . 1 . . .
#> [12,] 1 . . . 1 . . . . .
#> [13,] 1 . . . 1 . . . . .
#> [14,] 1 . . . 1 . . . . .
#> [15,] 1 . . . 1 . . . . .
#> [16,] 1 . . . 1 . . . . .
#> [17,] 1 . . . 1 . . . . .
#> [18,] . 1 . . . 1 . . . .
#> [19,] . 1 . . . 1 . . . .
#> [20,] . 1 . . . 1 . . . .
#> [21,] . 1 1 . . . . . . .
#> [22,] 1 . . . 1 . . . . .
#> [23,] 1 . . . 1 . . . . .
#> [24,] 1 . . . 1 . . . . .
#> [25,] 1 . . . 1 . . . . .
#> [26,] . 1 . . . 1 . . . .
#> [27,] 1 . . . . . . 1 . .
#> [28,] . 1 . . . . . 1 . .
#> [29,] 1 . . . . . . . . 1
#> [30,] 1 . . . . . . . 1 .
#> [31,] 1 . . . . . . . . 1
#> [32,] . 1 . . . 1 . . . .

Created on 2023-05-17 with reprex v2.0.2

kylebutts commented 1 year ago

BTW, probably need to make a few more changes to make it faster, so I'll push and circle back when I think it's in a good state!

kylebutts commented 1 year ago

@lrberge I think we are good to merge!

Some dev notes for HC2/HC3 implementation:

  1. Small-sample adjustments are a bit awkward with HC2/HC3 standard errors (since they are a form of small-sample adjustment). vcov_hetero_internal multiplies by N / N-1 (cluster.adj) which isn't needed for HC2/HC3. This makes vcov(est, "hc2", ssc = scc(adj = FALSE)) not match sandwich::vcovHC(est, "HC2") which is not a good default IMO. To fix this, instead of making the adjustment in vcov_hetero_internal, I instead adjust the scores before getting to that point by multiplying scores by sqrt(n / n - 1) in the scores section of the code. The reason I think this is the best option is in the scores section of vcov.fixest, I have to update the scores for HC2/HC3 anyways, so it's natural to make HC1 adjustments to the score as well at that point (see lines 757-794 in R/VCOV.R).

  2. I don't think this code does anything. The attribute ss_adj is not added by any vcov function

    ss_adj = attr(vcov_noAdj, "ss_adj")
    if(!is.null(ss_adj)){
        if(is.function(ss_adj)){
            ss_adj = ss_adj(n = n, K = K)
        }
        attr(vcov_noAdj, "ss_adj") = NULL
    }
grantmcdermott commented 1 year ago

@kylebutts This looks very interesting. Can you provide some quick benchmarks/comparisons to show practical gains over the current approach? Ofc we'd expect memory allocations to decrease significantly, but I'm curious to see speed implications too. Thanks.

kylebutts commented 1 year ago

@grantmcdermott I think memory usage will be about the same in most cases. model.matrix.fixest will return just a single vector for fixed effects (so not the matrix of 0/1s). That's the same memory footprint as a sparse matrix of the 0/1 dummies. The only difference is with i() which sparse will save on memory.

The creation step is just slightly slower than Matrix::sparse.model.matrix (tens of milliseconds), but (1) has more features (collin.rm) and (2) allows full use of fixest formula semantics. The slow step is adding colnames to the matrix actually (pasting strings together is surprisingly slow)

The real advantage is not the speed in which you can generate the matrix but the uses afterwards. For example, https://github.com/s3alfisc/summclust/pull/19#issue-1751016605 has a 13x speedup.

Here's a quick example of the speedups of avoiding dense matrices:

library(data.table)
library(fixest)
library(Matrix)

# Test -------------------------------------------------------------------------

N = 10000
var_fes = 10
var_eps = 4

# Ensure P_ii exists
N_i = pmax(1, rpois(N, 3))

# Create dataset
df = data.table(id = rep(1:N, N_i))
df[, j := 1:.N, by = id]
df[, fe_i := rnorm(1, 0, sd = sqrt(var_fes)), by = id]
df[, x := rnorm(.N, mean = fe_i, sd = 1)]
df[, x2 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x3 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x4 := rnorm(.N, mean = fe_i, sd = 1)]
df[, y := x * 1 + fe_i + rnorm(.N, 0, sqrt(var_eps))]

# Estimate FE model

model = feols(y ~ x | id, df)

X = fixest:::sparse_model_matrix(
  y ~ x | id, df, type = c("rhs", "fixef"), 
  combine = TRUE, collin.rm = FALSE
)
X_dense = as.matrix(X)
#> Warning in asMethod(object): sparse->dense coercion: allocating vector of size
#> 2.3 GiB

bench::mark(
  Matrix::crossprod(X, df$y),
  Matrix::crossprod(X_dense, df$y),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 Matrix::crossprod(X, df$y)          285µs   322µs   2620.       493KB     2.20
#> 2 Matrix::crossprod(X_dense, df$y)    488ms   499ms      2.00    89.8KB     0
s3alfisc commented 1 year ago

Just to chime in here - there are indeed large performance advantages for the cluster jackknife CRV3 estimator as suggested by MNW (2023) if X is sparse. And if I read the MNW paper correctly, I think that many people will be interested in using the CRV3 estimator in the future =) Unfortunately, it does not play too well with (arbitrary) fixed effects, so one has to bite the bullet and create the design matrix of all dummies in most cases.

Here is a quick implementation of a part of the algorithm, which works both for sparse and dense matrices:

library(Matrix)

N <- 100000
k <- 100
y <- rnorm(N)
X <- factor(sample(1:k, N, TRUE))
df <- data.frame(X = X)
mf <- model.frame(~X, X)
X <- sparse.model.matrix(mf)
beta <- rnorm(ncol(X))
y <- X %*% beta + rnorm(N)
cluster <- sample(letters, N, TRUE)

length(unique(cluster)
# 26 clusters

y_dense <- as.matrix(y)
X_dense <- as.matrix(X)

cluster_jackknife <- function(X = X, y = y, cluster = cluster){

  XXinv <- solve(crossprod(X))
  Xy <- t(X) %*% y

  cluster_groups <- unique(cluster)
  XgXg_list <- lapply(cluster_groups, function(g) solve(crossprod(X[cluster == g,])))
  Xgyg_list <- lapply(cluster_groups, function(g) t(X[cluster == g,]) %*% y[cluster == g])

  # compute leave-one-out regression coefs
  beta_jack <- lapply(seq_along(cluster_groups), function(g) MASS::ginv(as.matrix(XXinv - XgXg_list[[g]])) %*% (Xy -Xgyg_list[[g]]))
  beta_jack
}

microbenchmark::microbenchmark(
  cluster_jackknife(X_dense, y_dense, cluster),
  cluster_jackknife(X, y, cluster), 
  times = 1
)

# cluster_jackknife(X_dense, y_dense, cluster)
# cluster_jackknife(X, y, cluster)
# min        lq      mean    median        uq       max neval
# 1794.6250 1794.6250 1794.6250 1794.6250 1794.6250 1794.6250     1
# 385.1208  385.1208  385.1208  385.1208  385.1208  385.1208     1

So in this case, the sparse version is around 4-5x faster. The larger k and the dimension of clusters, the larger the speed gains will be.

Of course you could argue that I could just create the model matrix myself (at the cost of some performance overhead) - but another great advantage is that this PR creates the full model matrix of dummies - which will allow me to support all sort of fixest syntax I currently do not know how to handle both in fwildclusterboot and summclust (e.g. varying slopes).

kylebutts commented 1 year ago

Figured out a faster way to implement this using block formula for projection matrix: https://en.wikipedia.org/wiki/Projection_matrix#Blockwise_formula

Before (using approx to speed it up):

library(data.table)
library(fixest)
library(Matrix)
library(tictoc)
#> 
#> Attaching package: 'tictoc'
#> The following object is masked from 'package:data.table':
#> 
#>     shift

# devtools::load_all()

# Test -------------------------------------------------------------------------

N = 100000
var_fes = 10
var_eps = 4

# Ensure P_ii exists
N_i = pmax(2, rpois(N, 3))

# Create dataset
df = data.table(id = rep(1:N, N_i))
df[, j := 1:.N, by = id]
df[, fe_i := rnorm(1, 0, sd = sqrt(var_fes)), by = id]
df[, x := rnorm(.N, mean = fe_i, sd = 1)]
df[, x2 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x3 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x4 := rnorm(.N, mean = fe_i, sd = 1)]
df[, g := sample(1:25, .N, replace = TRUE)]
df[, y := x * 1 + fe_i + rnorm(.N, 0, sqrt(var_eps))]
df = as.data.frame(df)

# Estimate FE model
model = feols(y ~ x + x2 + x3 + x4 | id + g, df)

tic()
P_ii_orig = hatvalues(model, exact = FALSE, p = 500)
toc()
#> 12.39 sec elapsed

Created on 2023-06-17 with reprex v2.0.2

After (solving exactly):

library(data.table)
library(fixest)
library(Matrix)
library(tictoc)
#> 
#> Attaching package: 'tictoc'
#> The following object is masked from 'package:data.table':
#> 
#>     shift

# devtools::load_all()

# Test -------------------------------------------------------------------------

N = 100000
var_fes = 10
var_eps = 4

# Ensure P_ii exists
N_i = pmax(2, rpois(N, 3))

# Create dataset
df = data.table(id = rep(1:N, N_i))
df[, j := 1:.N, by = id]
df[, fe_i := rnorm(1, 0, sd = sqrt(var_fes)), by = id]
df[, x := rnorm(.N, mean = fe_i, sd = 1)]
df[, x2 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x3 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x4 := rnorm(.N, mean = fe_i, sd = 1)]
df[, g := sample(1:25, .N, replace = TRUE)]
df[, y := x * 1 + fe_i + rnorm(.N, 0, sqrt(var_eps))]
df = as.data.frame(df)

# Estimate FE model
model = feols(y ~ x + x2 + x3 + x4 | id + g, df)

tic()
P_ii_orig = hatvalues(model, exact = TRUE)
toc()
#> 1.1 sec elapsed

Created on 2023-06-17 with reprex v2.0.2

lrberge commented 5 months ago

Very cool PR! Nice work! I'll try to merge it next week.

kylebutts commented 5 months ago

@lrberge I put a couple things in this one PR. Want me rebase onto main? I'll also adjust the work with data.table commits based on your changes

kylebutts commented 1 month ago

Just found a bug (column names are not correct for factor()). I suspect the fix will be deparse the terms?

library(fixest)
est = feols(
  mpg ~ factor(am) + disp | vs, mtcars
)
fixest::sparse_model_matrix(est, collin.rm = FALSE)
#> 32 x 3 sparse Matrix of class "dgCMatrix"
#>       0 1  disp
#>  [1,] . 1 160.0
#>  [2,] . 1 160.0
#>  [3,] . 1 108.0
#>  [4,] 1 . 258.0
#>  [5,] 1 . 360.0
#>  [6,] 1 . 225.0
#>  [7,] 1 . 360.0
#>  [8,] 1 . 146.7
#>  [9,] 1 . 140.8
#> [10,] 1 . 167.6
#> [11,] 1 . 167.6
#> [12,] 1 . 275.8
#> [13,] 1 . 275.8
#> [14,] 1 . 275.8
#> [15,] 1 . 472.0
#> [16,] 1 . 460.0
#> [17,] 1 . 440.0
#> [18,] . 1  78.7
#> [19,] . 1  75.7
#> [20,] . 1  71.1
#> [21,] 1 . 120.1
#> [22,] 1 . 318.0
#> [23,] 1 . 304.0
#> [24,] 1 . 350.0
#> [25,] 1 . 400.0
#> [26,] . 1  79.0
#> [27,] . 1 120.3
#> [28,] . 1  95.1
#> [29,] . 1 351.0
#> [30,] . 1 145.0
#> [31,] . 1 301.0
#> [32,] . 1 121.0
fixest::sparse_model_matrix(est, collin.rm = TRUE)
#> 32 x 1 sparse Matrix of class "dgCMatrix"
#>        disp
#>  [1,] 160.0
#>  [2,] 160.0
#>  [3,] 108.0
#>  [4,] 258.0
#>  [5,] 360.0
#>  [6,] 225.0
#>  [7,] 360.0
#>  [8,] 146.7
#>  [9,] 140.8
#> [10,] 167.6
#> [11,] 167.6
#> [12,] 275.8
#> [13,] 275.8
#> [14,] 275.8
#> [15,] 472.0
#> [16,] 460.0
#> [17,] 440.0
#> [18,]  78.7
#> [19,]  75.7
#> [20,]  71.1
#> [21,] 120.1
#> [22,] 318.0
#> [23,] 304.0
#> [24,] 350.0
#> [25,] 400.0
#> [26,]  79.0
#> [27,] 120.3
#> [28,]  95.1
#> [29,] 351.0
#> [30,] 145.0
#> [31,] 301.0
#> [32,] 121.0
model.matrix(est)
#>                     factor(am)1  disp
#> Mazda RX4                     1 160.0
#> Mazda RX4 Wag                 1 160.0
#> Datsun 710                    1 108.0
#> Hornet 4 Drive                0 258.0
#> Hornet Sportabout             0 360.0
#> Valiant                       0 225.0
#> Duster 360                    0 360.0
#> Merc 240D                     0 146.7
#> Merc 230                      0 140.8
#> Merc 280                      0 167.6
#> Merc 280C                     0 167.6
#> Merc 450SE                    0 275.8
#> Merc 450SL                    0 275.8
#> Merc 450SLC                   0 275.8
#> Cadillac Fleetwood            0 472.0
#> Lincoln Continental           0 460.0
#> Chrysler Imperial             0 440.0
#> Fiat 128                      1  78.7
#> Honda Civic                   1  75.7
#> Toyota Corolla                1  71.1
#> Toyota Corona                 0 120.1
#> Dodge Challenger              0 318.0
#> AMC Javelin                   0 304.0
#> Camaro Z28                    0 350.0
#> Pontiac Firebird              0 400.0
#> Fiat X1-9                     1  79.0
#> Porsche 914-2                 1 120.3
#> Lotus Europa                  1  95.1
#> Ford Pantera L                1 351.0
#> Ferrari Dino                  1 145.0
#> Maserati Bora                 1 301.0
#> Volvo 142E                    1 121.0

Created on 2024-05-09 with reprex v2.1.0