Boottest() cannot grab information after using `lapply` or `purrr::map` with regressions. Workaround obstructs presentation with modelsummary.

s3alfisc / fwildclusterboot

Fast Wild Cluster Bootstrap Inference for Regression Models / OLS in R. Additionally, R port to WildBootTests.jl via the JuliaConnectoR.

https://s3alfisc.github.io/fwildclusterboot/

GNU General Public License v3.0

23 stars 4 forks source link

Boottest() cannot grab information after using `lapply` or `purrr::map` with regressions. Workaround obstructs presentation with modelsummary. #42

Closed michaeltopper1 closed 2 years ago

michaeltopper1 commented 2 years ago

I've been trying to implement the bootstrapped standard errors on my regressions. However, to save code, I frequently use lapply or purrr::map to do this iteratively. However, I've noticed that this causes problems with the boottest function. Specifically, it cannot grab the correct model objects when using lapply or purrr::map. See the below example:

library(tidyverse)
library(fixest)
library(fwildclusterboot)
library(modelsummary)

## creating a list of data
cars_list <- list(mtcars, mtcars)

## iterating through regressions
cars_output <- map(cars_list,~feols(mpg ~ cyl, data = .))

## this cannot be evaluated - it believes object `.` is the data to be used
map(cars_output, ~boottest(., param = "cyl", clustid = "gear", B = 333))

There is a workaround to this. I can do the following:

## Note that I am creating a named list here
cars_list <- dplyr::lst(mtcars, mtcars)

## iterating through the regressions - giving the call$data the name mtcars
cars_output <- imap(cars_list,~{
  model <- feols(mpg ~ cyl, data = .x)
  model$call$data <- as.symbol(.y) 
  model})

x <- map(cars_output, ~boottest(., param = "cyl", clustid = "gear", B = 333))

While this workaround DOES produce the desired results, exporting them to modelsummary gives errors when it comes to goodness-of-fit statistics. See here:

## produces only coefficient estimates and standard errors, no goodness-of-fit statistics
msummary(x, estimate = "{estimate} ({p.value})", 
         statistic = "[{conf.low}, {conf.high}]", output = "data.frame")

I am wondering if this function can be updated so that this sort of framework can be supported.

s3alfisc commented 2 years ago

Hi Michael, thanks for reporting this! I am fairly certain that the first error is due to issues with the environment from which boottest() tries to fetch the data. I will take a closer look this evening! Best, Alex

s3alfisc commented 2 years ago

Ok, it seems like I will have to take a deeper look at the section on environments in Advanced R for a "quick fix".

Alternatively, I might fully replace the call to model.frame (which causes the troubles above) and work with model.matrix instead. I thought about doing this anyways, as working with model.matrices would allow to call boottest() based on objects of type fixest with fixest formula sugar (e.g. i() or ^, which boottest() currently does not support). It would also simplify the pre-processing code by a lot.

The main caveat is that its not straightforward to fetch the dependent variable without a model.frame for models of type lm - one workaround would be to re-create the dependent variable by using predict() and residuals methods as Y <- predict(obj) + residuals(obj). Unfortunately, there is no predict method available for lfe::felm, so I'd have to come up with another solution. For fixest, there is no problem - I can simply use fixest:::model.matrix.fixest(, type = "lhs"). Also, it would be much easier to add support for the WRE after IV-estimation via feols().

Beyond, not relying on model.frame would force me to change the way how cluster variables are passed to boottest() - they either would have to be included in the model.matrix, in which case users could specify the clusters as a character vector or formula. If not part of the model.matrix, I would have to ask users to pass it to boottest() as a vector of length N. Currently, if the cluster variable is not part of the model formula, I add it to the formula of the model call and evaluate the model.frame on the updated formula.

Anyways, I am not sure if I managed to explain myself clearly 😄 - the main conclusion is that I'll have to think all of this through. If there's a quick fix, I will update the package during the week, but it is likelier that I will move the pre-processing from to model.matrices over the weekend.

michaeltopper1 commented 2 years ago

Ah I see. Thank you for the quick response and great explanation. In the meantime, I'll use my messy workaround. Looking forward to the changes!

s3alfisc commented 2 years ago

Hi Michael,

Here's a quick workaround for your problem using rlang's !! operator and lapply - the main difference to your solution is that the code below updates the data part of the call to lm before the call is evaluated, not after:

library(rlang)
library(modelsummary)
library(fwildclusterboot)

res <- 
lapply(list(quote(mtcars), quote(mtcars)), function(x){
  fit_uneval <- rlang::expr(lm(mpg ~ cyl, data = !!x))
  fit <- eval(fit_uneval)
  boottest(fit, param = "cyl", clustid = "gear", B = 333)
})

msummary(res, estimate = "{estimate} ({p.value})", 
         statistic = "[{conf.low}, {conf.high}]", output = "data.frame")

part      term         statistic          Model 1          Model 2
1 estimates 1*cyl = 0 modelsummary_tmp1   -2.876 (0.000)   -2.876 (0.000)
2 estimates 1*cyl = 0 modelsummary_tmp2 [-3.188, -2.307] [-3.188, -2.307]
3       gof  Num.Obs.                                 32               32
4       gof        R2                              0.726            0.726
5       gof   R2 Adj.                              0.717            0.717
6       gof       AIC                              169.3            169.3
7       gof       BIC                              173.7            173.7
8       gof  Log.Lik.                            -81.653          -81.653

Best, Alex

michaeltopper1 commented 2 years ago

Thanks for this work around Alex! However, while the boottest works great using this method, I can't seem to get the goodness-of-fit statistics to work when I copy your code. My packages (except fwildclusterboot) are all up to date. Is this something that could be fixed with fwildclusterboot version 0.8 rather than 0.7? While I would gladly update to version 0.8, my computer (Mac) won't let me as I get the classic "There is a binary source available but the source version is later" error.

s3alfisc commented 2 years ago

I don't think that your problem will be fixed by updating to fwildclusterboot 0.8, unfortunately. 0.8 is mostly about integrating WildBootTests.jl and reorganizing some code. Nevertheless, I think the binaries for Mac should be on CRAN by now? If installation from CRAN does not work, you can try to install a compiled version (for mac) from r-universe by running

install.packages('fwildclusterboot', repos ='https://s3alfisc.r-universe.dev')

By the way, I am almost done moving from model.frame() to model.matrix() methods, and by now I am less optimistic that just doing that will let your map() example run out of the box. I'll let you know once I manage to fix this :)

s3alfisc commented 2 years ago

See this issue: https://github.com/kosukeimai/MatchIt/issues/53

s3alfisc commented 2 years ago

Bug seems to be specific to fixest::feols(), which evaluates in custom environment:

iter1 <- 
  lapply(
 list(mtcars, mtcars), function(x){
   fit <- feols(mpg ~ cyl, data = x)
 }
)

iter2 <- 
  lapply(iter1, function(x){
    boottest(x, param = "cyl", clustid = "gear", B = 333)
})
# Error in eval(model$call$data, envir) : object 'x' not found
iter2 <- 
  lapply(iter1, function(x){
    sandwich::vcovCL(x, cluster = ~gear)
  })
# Error in eval(model$call$data, envir) : object 'x' not found

but everything runs fine if feols() is replaced with lm() or felm()

iter1 <- 
  lapply(
 list(mtcars, mtcars), function(x){
   fit <- lm(mpg ~ cyl, data = x)
 }
)

fixest::feols() evaluates in a custom environment:

library(fixest)
library(lfe)

feols_fit <- feols(mpg ~ cyl, data = mtcars)
lm_fit <- lm(mpg ~ cyl, data = mtcars)
felm_fit <- felm(mpg ~ cyl, data = mtcars)

environment(terms(feols_fit))
# <environment: 0x0000029cd30926a8>
environment(terms(lm_fit))
# <environment: R_GlobalEnv>
environment(terms(felm_fit))
# <environment: R_GlobalEnv>

s3alfisc commented 2 years ago

Solution is described here: the issue was - as expected - an incorrect environment used with expand.model.frame. I have fixed the issue already in the code and will (hopefully) release a new version of fwildclusterboot by tomorrow evening (EU time) 😃

s3alfisc commented 2 years ago

With the latest commit ace208ebf5265a06f2d6a64bf7e020165ac9c827, it is possible to use boottest() with fixest::feols() and purrr::map() / lapply(). You should be able to install the latest version from r-universe within the next hour (given all unit tests pass, they run for a while).

Unfortunately, the resulting list of objects of type boottest is still not working with modelsummary::msummary():

library(tidyverse)
library(fixest)
library(fwildclusterboot)
library(modelsummary)

## creating a list of data
cars_list <- list(mtcars, mtcars)

## iterating through regressions
cars_output <- lapply(cars_list, \(x) feols(mpg ~ cyl, data = x))

## this cannot be evaluated - it believes object `.` is the data to be used
boot_list = lapply(cars_output, \(x) boottest(x, param = "cyl", clustid = "gear", B = 333))

msummary(model = boot_list, 
         estimate = "{estimate} ({p.value})", 
         statistic = "[{conf.low}, {conf.high}]")

Instead, you can simply add desired columns produced via fwildclusterboot::boottest() to the modelsummary() output to cars_output:

library(tibble)
rows <- tribble(~term,          ~OLS,  ~Logit,
                'cyl, boottest p-value', boot_list[[1]]$p_val,   boot_list[[1]]$p_val,
                'cyl, boottest conf.int', paste(round(boot_list[[1]]$conf_int,3)), paste(round(boot_list[[2]]$conf_int),3))
attr(rows, 'position') <- c(5,6)

modelsummary(cars_output, add_rows = rows)

michaeltopper1 commented 2 years ago

Beautiful! Thanks for this update. Although it cannot work with modelsummary::msummary, your work around is a nice and simple solution that I frequently use to customize tables. Thank you again and very happy that I will now be able to wildclusterboot my standard errors with lapply and fixest!

s3alfisc commented 2 years ago

Hi, 'msummary' is actually just a shortcut for 'modelsummary", as far as I know :)