Inference using fixest's aggregate command

jtorcasso commented 1 year ago

Fixest's aggregate command may be an efficient alternative for inference in cases when marginaleffects takes a long time. Consider the MWE below:

library(etwfe)

data("mpdta", package="did")

mod = etwfe(
    fml = lemp ~ lpop,
    tvar  = year,
    gvar = first.treat,
    data = mpdta,
    vcov = ~countyreal
)

emfx(mod)

aggregate(mod, c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$"))

emfx returns an estimate of -0.05062703 with std. error of 0.01249979, and aggregate returns the same exact values in this case.

grantmcdermott commented 1 year ago

This is great, thanks Jake.

The problem with aggregate is that it doesn't necessarily work well for non-linear families.

library(etwfe)

data("mpdta", package="did")
mpdta$emp = exp(mpdta$lemp)

pmod = etwfe(
    fml = emp ~ lpop,
    tvar  = year,
    gvar = first.treat,
    data = mpdta,
    vcov = ~countyreal,
    family = "poisson"
)
#> The variables '.Dtreat:first.treat::2006:year::2004', '.Dtreat:first.treat::2006:year::2005' and eight others have been removed because of collinearity (see $collin.var).

emfx(pmod)
#> 
#>     Term                 Contrast .Dtreat Estimate Std. Error     z Pr(>|z|)
#>  .Dtreat mean(TRUE) - mean(FALSE)    TRUE    -28.6       18.4 -1.55     0.12
#>  2.5 % 97.5 %
#>  -64.6   7.49
#> 
#> Columns: term, contrast, .Dtreat, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo

aggregate(pmod, c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$"))
#>        Estimate Std. Error   t value     Pr(>|t|)
#> ATT -0.05247523 0.01206096 -4.350834 1.356206e-05

^{Created on 2023-04-26 with reprex v2.0.2}

But it's definitely an attractive option for big linear models. Let me think about some internal ode logic.

s3alfisc commented 1 year ago

Hi both - one other big advantage of moving to fixest::aggregate() is that it will make it very straightforward for me to implement wild cluster bootstrap support, as I basically copied it into boot_aggregate to support fixest::sunab() =)

s3alfisc commented 1 year ago

Hi @jtorcasso, here's a first attempt of wild cluster bootstrap and etwfe interoperatibility via fwildclusterboot::boot_aggregate(), using @kylebutts fixest clone that introduces the sparse_model_matrix method. It's very buggy, so using other arguments of boot_aggregate than used in the example / non-fixest objects might fail.

library(devtools)
install_github("https://github.com/s3alfisc/fwildclusterboot/tree/etwfe-support")
# this should install kyle's fork of fixest, if not, do it manually
#install_github("https://github.com/kylebutts/fixest/tree/sparse-matrix")

library(etwfe)
library(fwildclusterboot)
data("mpdta", package="did")

mod = etwfe(
  fml = lemp ~ lpop,
  tvar  = year,
  gvar = first.treat,
  data = mpdta,
  vcov = ~countyreal, 
  ssc = fixest::ssc(adj = FALSE, cluster.adj = FALSE)
)

emfx(mod)
# Term                 Contrast .Dtreat Estimate Std. Error     z Pr(>|z|)    S  2.5 %
# .Dtreat mean(TRUE) - mean(FALSE)    TRUE  -0.0506     0.0124 -4.08   <0.001 14.4 -0.075
# 97.5 %
# -0.0263

aggregate(
  mod, c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$")
)
# Estimate Std. Error   t value     Pr(>|t|)
# ATT -0.05062703  0.0124121 -4.078845 5.267553e-05

boot_aggregate(
  mod,
  B = 999, 
  agg = c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$"), 
  clustid = ~countyreal, 
  ssc = boot_ssc(adj = FALSE, cluster.adj = FALSE)
)
# Run the wild bootstrap: this might take some time...(but 
# hopefully not too much time =) ). 
# |======================================================| 100%     
#        Estimate   t value    Pr(>|t|)     [0.025%     0.975%]
# [1,] -0.05062703 -4.078845 0.001001001 -0.07420141 -0.02634166
# Warning message:
#   Matrix inversion failure: Using a generalized inverse instead. Check the produced
# t-statistic, does it match the one of your regression package (under the same
# small sample correction)? If yes, this is likely not something to worry about.

Note that the non-bootstrapped t-statistics from fwildclusterboot and fixest match exactly, as it should be =) Also, note that using solve() fails, but using a generalized inverse seems to get the job done.

Best, Alex

poetlarsen commented 9 months ago

Hi, I am following up on this thread to see if there were any updates or recommendations to aggregate estimates for event studies more efficiently. I'm working with several datasets, one of which is nearly 50 million observations and 35 periods, and the current marginaleffects setup with emfx takes so long its impossible to generate meaningful summary results. (This is driven by the SE estimation). I need SEs for plots, so turning off vcov isn't really an option. Any help or suggestions would be greatly appreciated!

kylebutts commented 9 months ago

@poetlarsen Bootstrap is your friend on this one with emfx(mod, vcov = FALSE)

s3alfisc commented 9 months ago

Or just using "aggregate" as described above should also work, unless you run a Poisson regression?

grantmcdermott commented 9 months ago

+1 to @kylebutts and @s3alfisc's comments.

OTOH—and apart from the non-linear family issue—the other tricky thing with the aggregate.fixest approach is that there isn't a straightforward way to recover an event study path using the function's regex logic. The issue being that regex doesn't understand ordering in a way that allows it to easily group relative timing among the relevant cohorts. At least, not with the the standard etwfe specification, since we don't specify relative timing up front as part of the model formula. (If anyone has an idea of how to get around this, I'm all ears!)

Summarising, I think we can use aggregate.fixest to quickly recover the "simple" ATT, as well as the "group" and "calendar" effects. But not the "event" type. Here's some sample code that generalizes @jtorcasso's original example that should work on any linear model, regardless of the input data and variable specifics.

## estimate your model
mod = etwfe(...)

## Simple ATT

# emfx(mod) ## for comparison
aggregate(mod, c("ATT" = "^\\.Dtreat(?:(?!_dm$).)*$"))

# For more complex ATTs we'll need the input gvar and tvar

gvar = attr(mod, "etwfe")[["gvar"]]
tvar = attr(mod, "etwfe")[["tvar"]]

## Group ATTs

# emfx(mod, "group") ## for comparison
aggregate(
  mod,
  paste0("(", gvar, "::[[:digit:]]+)(?:(?!_dm$).)*$")
)

## Calendar ATTs

# emfx(mod, "calendar") ## for comparison
aggregate(
  mod,
  paste0("(", tvar, "::[[:digit:]]+)((?:(?!_dm$).)*$)") 
)

## Event ATTs (??)

# emfx(mod, "event")) ## For comparison
## ??
# aggregate(
#   mod,
#   paste0("(", gvar, ".*", tvar, ")((?:(?!_dm$).)*$)")
# )

Considering all of this, @poetlarsen it might be worth exploring whether another of the "modern" estimators isn't better suited to your use case. The fastest option for that many observations is almost certainly going to be sunab, although I suspect that did2s will give a good account itself too.

grantmcdermott / etwfe

Inference using fixest's aggregate command #33