Open jtorcasso opened 1 year ago
This is great, thanks Jake.
The problem with aggregate
is that it doesn't necessarily work well for non-linear families.
library(etwfe)
data("mpdta", package="did")
mpdta$emp = exp(mpdta$lemp)
pmod = etwfe(
fml = emp ~ lpop,
tvar = year,
gvar = first.treat,
data = mpdta,
vcov = ~countyreal,
family = "poisson"
)
#> The variables '.Dtreat:first.treat::2006:year::2004', '.Dtreat:first.treat::2006:year::2005' and eight others have been removed because of collinearity (see $collin.var).
emfx(pmod)
#>
#> Term Contrast .Dtreat Estimate Std. Error z Pr(>|z|)
#> .Dtreat mean(TRUE) - mean(FALSE) TRUE -28.6 18.4 -1.55 0.12
#> 2.5 % 97.5 %
#> -64.6 7.49
#>
#> Columns: term, contrast, .Dtreat, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo
aggregate(pmod, c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$"))
#> Estimate Std. Error t value Pr(>|t|)
#> ATT -0.05247523 0.01206096 -4.350834 1.356206e-05
Created on 2023-04-26 with reprex v2.0.2
But it's definitely an attractive option for big linear models. Let me think about some internal ode logic.
Hi both - one other big advantage of moving to fixest::aggregate()
is that it will make it very straightforward for me to implement wild cluster bootstrap support, as I basically copied it into boot_aggregate to support fixest::sunab()
=)
Hi @jtorcasso, here's a first attempt of wild cluster bootstrap and etwfe interoperatibility via fwildclusterboot::boot_aggregate()
, using @kylebutts fixest
clone that introduces the sparse_model_matrix
method. It's very buggy, so using other arguments of boot_aggregate
than used in the example / non-fixest objects might fail.
library(devtools)
install_github("https://github.com/s3alfisc/fwildclusterboot/tree/etwfe-support")
# this should install kyle's fork of fixest, if not, do it manually
#install_github("https://github.com/kylebutts/fixest/tree/sparse-matrix")
library(etwfe)
library(fwildclusterboot)
data("mpdta", package="did")
mod = etwfe(
fml = lemp ~ lpop,
tvar = year,
gvar = first.treat,
data = mpdta,
vcov = ~countyreal,
ssc = fixest::ssc(adj = FALSE, cluster.adj = FALSE)
)
emfx(mod)
# Term Contrast .Dtreat Estimate Std. Error z Pr(>|z|) S 2.5 %
# .Dtreat mean(TRUE) - mean(FALSE) TRUE -0.0506 0.0124 -4.08 <0.001 14.4 -0.075
# 97.5 %
# -0.0263
aggregate(
mod, c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$")
)
# Estimate Std. Error t value Pr(>|t|)
# ATT -0.05062703 0.0124121 -4.078845 5.267553e-05
boot_aggregate(
mod,
B = 999,
agg = c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$"),
clustid = ~countyreal,
ssc = boot_ssc(adj = FALSE, cluster.adj = FALSE)
)
# Run the wild bootstrap: this might take some time...(but
# hopefully not too much time =) ).
# |======================================================| 100%
# Estimate t value Pr(>|t|) [0.025% 0.975%]
# [1,] -0.05062703 -4.078845 0.001001001 -0.07420141 -0.02634166
# Warning message:
# Matrix inversion failure: Using a generalized inverse instead. Check the produced
# t-statistic, does it match the one of your regression package (under the same
# small sample correction)? If yes, this is likely not something to worry about.
Note that the non-bootstrapped t-statistics from fwildclusterboot
and fixest
match exactly, as it should be =) Also, note that using solve()
fails, but using a generalized inverse seems to get the job done.
Best, Alex
Hi, I am following up on this thread to see if there were any updates or recommendations to aggregate estimates for event studies more efficiently. I'm working with several datasets, one of which is nearly 50 million observations and 35 periods, and the current marginaleffects setup with emfx takes so long its impossible to generate meaningful summary results. (This is driven by the SE estimation). I need SEs for plots, so turning off vcov isn't really an option. Any help or suggestions would be greatly appreciated!
@poetlarsen Bootstrap is your friend on this one with emfx(mod, vcov = FALSE)
Or just using "aggregate" as described above should also work, unless you run a Poisson regression?
+1 to @kylebutts and @s3alfisc's comments.
OTOH—and apart from the non-linear family issue—the other tricky thing with the aggregate.fixest
approach is that there isn't a straightforward way to recover an event study path using the function's regex logic. The issue being that regex doesn't understand ordering in a way that allows it to easily group relative timing among the relevant cohorts. At least, not with the the standard etwfe
specification, since we don't specify relative timing up front as part of the model formula. (If anyone has an idea of how to get around this, I'm all ears!)
Summarising, I think we can use aggregate.fixest
to quickly recover the "simple" ATT, as well as the "group" and "calendar" effects. But not the "event" type. Here's some sample code that generalizes @jtorcasso's original example that should work on any linear model, regardless of the input data and variable specifics.
## estimate your model
mod = etwfe(...)
## Simple ATT
# emfx(mod) ## for comparison
aggregate(mod, c("ATT" = "^\\.Dtreat(?:(?!_dm$).)*$"))
# For more complex ATTs we'll need the input gvar and tvar
gvar = attr(mod, "etwfe")[["gvar"]]
tvar = attr(mod, "etwfe")[["tvar"]]
## Group ATTs
# emfx(mod, "group") ## for comparison
aggregate(
mod,
paste0("(", gvar, "::[[:digit:]]+)(?:(?!_dm$).)*$")
)
## Calendar ATTs
# emfx(mod, "calendar") ## for comparison
aggregate(
mod,
paste0("(", tvar, "::[[:digit:]]+)((?:(?!_dm$).)*$)")
)
## Event ATTs (??)
# emfx(mod, "event")) ## For comparison
## ??
# aggregate(
# mod,
# paste0("(", gvar, ".*", tvar, ")((?:(?!_dm$).)*$)")
# )
Considering all of this, @poetlarsen it might be worth exploring whether another of the "modern" estimators isn't better suited to your use case. The fastest option for that many observations is almost certainly going to be sunab, although I suspect that did2s will give a good account itself too.
Fixest's aggregate command may be an efficient alternative for inference in cases when
marginaleffects
takes a long time. Consider the MWE below:emfx
returns an estimate of-0.05062703
with std. error of0.01249979
, andaggregate
returns the same exact values in this case.