grantmcdermott / etwfe

Extended two-way fixed effects
https://grantmcdermott.com/etwfe/
Other
50 stars 11 forks source link

emfx() error when using offset in Poisson #28

Closed mariofiorini closed 1 year ago

mariofiorini commented 1 year ago

Hi Scott, thanks for the great package. I am using etwfe in a Poisson regression with an exposure variable. etwfe() works but emfx() returns an error message. Example using mpdta data and the lpop variable as the exposure.

data("mpdta", package = "did")
mpdta$emp = exp(mpdta$lemp)

etwfe(
  emp ~ 0, tvar = year, gvar = first.treat, data = mpdta, vcov = ~countyreal,
  family = "poisson", offset = ~log(lpop)
) |>
  emfx()

Error: Unable to compute predicted values with this model. You can try to supply a different dataset to the `newdata` argument. If
  this does not work, you can file a report on the Github Issue Tracker:
  https://github.com/vincentarelbundock/marginaleffects/issues

It looks like a problem with the marginaleffects package. Let me know if there is a simple workaround or if I am mistaken.

grantmcdermott commented 1 year ago

Scott? :-)

Yeah, looks like an upstream issue with marginaleffects. I'm not at my computer rn to check, but have you tried passing a vanilla fixest::fepois() model with an offset to marginaleffects::slopes()? If that causes a problem, then we should file an issue over there. Please try and lmk.

mariofiorini commented 1 year ago

Hi Grant, I am really sorry, I don't know where 'Scott' came from! Following your suggestion I tried

library(fixest)
library(marginaleffects)

data(trade)
gravity_pois_offset = fepois(data = trade, Euros ~ 1 | Origin , offset = ~log(dist_km))
marginaleffects::slopes(gravity_pois_offset)

and got the error message

Error: The function supplied to the `transform_pre` argument must accept two numeric vectors of predicted probabilities of length 0, and
  return a single numeric value or a numeric vector of length 0, with no missing value.
In addition: Warning messages:
1: Unable to extract a variance-covariance matrix from this model. 
2: In formals(fun) : argument is not a function

If I run it without the offset it works

gravity_pois_nooffset = fepois(data = trade, Euros ~ 1 | Origin )
marginaleffects::slopes(gravity_pois_3)

  Term Contrast Estimate
 Origin  BE - AT 54414051
 Origin  BE - AT 54414051
 Origin  BE - AT 54414051
 Origin  BE - AT 54414051
 Origin  BE - AT 54414051
--- 536540 rows omitted. See ?avg_slopes and ?print.marginaleffects --- 
 Origin  SE - AT -4449567
 Origin  SE - AT -4449567
 Origin  SE - AT -4449567
 Origin  SE - AT -4449567
 Origin  SE - AT -4449567
Prediction type:  response 
Columns: rowid, type, term, contrast, estimate, predicted, predicted_hi, predicted_lo, Euros, Origin 

Warning message:
Unable to extract a variance-covariance matrix from this model. 

Note that when using the mpdta data set I got an error message even without the offset, but maybe there I am making a mistake?

fit.no.offset <- fepois(fml = emp ~ 1 | year , data = mpdta)
marginaleffects::slopes(fit.no.offset)

Error in `[[<-.data.frame`(`*tmp*`, "type", value = "response") : 
  replacement has 1 row, data has 0
grantmcdermott commented 1 year ago

Great, thanks for double checking. Given that this is an upstream issue, I'm going to close it here. Please file your previous example on the marginaleffects issue page; I'm sure Vincent will get to the bottom of it soon.

Feel free to reopen here if I'm missing something, or if the problem persists after any updates to marginaleffects.

grantmcdermott commented 1 year ago

PS. Please hyperlink to this issue when you file your issue on the marginaleffects repo (just paste the URL at the top of this page somewhere in the body of your issue text). That way, these two issues will be linked and anyone who runs into the same problem will be able to follow the chain to any fix.

levialtringer commented 1 year ago

Hi there,

I am also wrestling with the implementation of an exposure variable within etwfe.


Background:

I am investigating the impact of wildlife management interventions on reported wildlife strike metrics via a staggered DiD identification strategy. My data is a balanced panel of 24 US airports (14 that receive treatment and 11 that never receive treatment) over the 2005-2021 period. The count of wildlife strikes at airport i in year t is a direct function of the number of aircraft movements that occurred at that airport--i.e., exposure. I am hoping to implement nonlinear DiD with an offset term so that I can model "wildlife strike rates" as opposed to "wildlife strike counts".

Here is my data:

> head(df)
  airport year strikes log_movements treated treatment_year treat
1     BAF 2005       0      11.01602       1           2009     0
2     BAF 2006       0      11.12964       1           2009     0
3     BAF 2007       4      11.05263       1           2009     0
4     BAF 2008       6      10.98798       1           2009     0
5     BAF 2009       6      10.99484       1           2009     1
6     BAF 2010       6      11.03463       1           2009     1

Errors in etwfe with offset:

Like mariofiorni, I am unable to utilize offset and recover ATTs.

  1. No offset included in the model:
    mod <- etwfe(fml = strikes ~ 1,
             tvar = year,
             gvar = treatment_year,
             vcov = ~airport,
             family = "poisson",
             data = df)
    emfx(mod)
> mod <- etwfe(fml = strikes ~ 1,
+              tvar = year,
+              gvar = treatment_year,
+              vcov = ~airport,
+              family = "poisson",
+              data = df)
The variables '.Dtreat:treatment_year::2008:year::2006', '.Dtreat:treatment_year::2008:year::2007' and forty-three others have been removed because of collinearity (see $collin.var).
> emfx(mod)

    Term                 Contrast .Dtreat Estimate Std. Error    z Pr(>|z|)  2.5 % 97.5 %
 .Dtreat mean(TRUE) - mean(FALSE)    TRUE     13.1       6.99 1.87   0.0619 -0.652   26.8

Prediction type:  response 
Columns: type, term, contrast, .Dtreat, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo
  1. offset included as an etwfe option:
    mod <- etwfe(fml = strikes ~ 1,
             tvar = year,
             gvar = treatment_year,
             vcov = ~airport,
             family = "poisson",
             data = df, 
             offset = ~log_movements)
    emfx(mod)
> mod <- etwfe(fml = strikes ~ 1,
+              tvar = year,
+              gvar = treatment_year,
+              vcov = ~airport,
+              family = "poisson",
+              data = df, 
+              offset = ~log_movements)
The variables '.Dtreat:treatment_year::2008:year::2006', '.Dtreat:treatment_year::2008:year::2007' and forty-three others have been removed because of collinearity (see $collin.var).
> emfx(mod)
Error: The function supplied to the `transform_pre` argument must accept two numeric vectors of predicted probabilities of length 0, and return a
  single numeric value or a numeric vector of length 0, with no missing value.
In addition: Warning message:
In formals(fun) : argument is not a function
  1. offset() included in model formula:
    mod <- etwfe(fml = strikes ~ offset(log_movements),
             tvar = year,
             gvar = treatment_year,
             vcov = ~airport,
             family = "poisson",
             data = df)
> mod <- etwfe(fml = strikes ~ offset(log_movements),
+              tvar = year,
+              gvar = treatment_year,
+              vcov = ~airport,
+              family = "poisson",
+              data = df)
Error in str2lang(x) : <text>:1:94: unexpected input
1: strikes  ~  .Dtreat : i(treatment_year, i.year, ref = 0, ref2 = 2005) / offset(log_movements)_

Simple TWFE approach via fixes::fepois() and marginaleffects::avg_slopes():

Here I attempt the approach recommended by grantmcdermott above.

  1. No offset included in the model:
    mod <- fixest::fepois(strikes ~ treat | airport + year, df, se = "cluster")
    mod
    marginaleffects::avg_slopes(mod)
> mod <- fixest::fepois(strikes ~ treat | airport + year, df, se = "cluster")
> mod
Poisson estimation, Dep. Var.: strikes
Observations: 408 
Fixed-effects: airport: 24,  year: 17
Standard-errors: Clustered (airport) 
      Estimate Std. Error t value Pr(>|t|) 
treat 0.262385   0.168873 1.55374  0.12025 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -1,892.0   Adj. Pseudo R2: 0.766339
           BIC:  4,030.4     Squared Cor.: 0.901086
> marginaleffects::avg_slopes(mod)

  Term Contrast Estimate Std. Error    z Pr(>|z|) 2.5 % 97.5 %
 treat    1 - 0     7.85       5.74 1.37    0.172 -3.41   19.1

Prediction type:  response 
Columns: type, term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high
  1. offset included as a fepois option:
    mod <- fixest::fepois(strikes ~ treat | airport + year, df, se = "cluster", offset = ~log_movements)
    mod
    marginaleffects::avg_slopes(mod)
> mod <- fixest::fepois(strikes ~ treat | airport + year, df, se = "cluster", offset = ~log_movements)
> mod
Poisson estimation, Dep. Var.: strikes
Observations: 408 
Offset: log_movements 
Fixed-effects: airport: 24,  year: 17
Standard-errors: Clustered (airport) 
      Estimate Std. Error t value Pr(>|t|) 
treat 0.252704   0.202431 1.24834   0.2119 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -2,038.2   Adj. Pseudo R2: 0.748657
           BIC:  4,322.8     Squared Cor.: 0.885169
> marginaleffects::avg_slopes(mod)
Error: The function supplied to the `transform_pre` argument must accept two numeric vectors of predicted probabilities of length 0, and return a
  single numeric value or a numeric vector of length 0, with no missing value.
In addition: Warning message:
In formals(fun) : argument is not a function
  1. offset() included in model formula:
    mod <- fixest::fepois(strikes ~ treat + offset(log_movements) | airport + year, df, se = "cluster")
    mod
    marginaleffects::avg_slopes(mod)
> mod <- fixest::fepois(strikes ~ treat + offset(log_movements) | airport + year, df, se = "cluster")
> mod
Poisson estimation, Dep. Var.: strikes
Observations: 408 
Offset: log_movements 
Fixed-effects: airport: 24,  year: 17
Standard-errors: Clustered (airport) 
      Estimate Std. Error t value Pr(>|t|) 
treat 0.252704   0.202431 1.24834   0.2119 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -2,038.2   Adj. Pseudo R2: 0.748657
           BIC:  4,322.8     Squared Cor.: 0.885169
> marginaleffects::avg_slopes(mod)

          Term Contrast Estimate Std. Error   z Pr(>|z|)     2.5 %   97.5 %
 log_movements    dY/dX 0.00e+00         NA  NA       NA        NA       NA
 treat            1 - 0 9.11e-05   8.26e-05 1.1     0.27 -7.07e-05 0.000253

Prediction type:  response 
Columns: type, term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high 

Conclusions

############### Coefficient Estimate and Marginal Effects via FE Poisson with Offset ###############

  Estimate Std. Error t value Pr(>|t|) 

treat 0.252704 0.202431 1.24834 0.2119

      Term Contrast Estimate Std. Error   z Pr(>|z|)     2.5 %   97.5 %

log_movements dY/dX 0.00e+00 NA NA NA NA NA treat 1 - 0 9.11e-05 8.26e-05 1.1 0.27 -7.07e-05 0.000253


Does anybody know why the following models

fixest::fepois(strikes ~ treat | airport + year, df, se = "cluster") fixest::fepois(strikes ~ treat + offset(log_movements) | airport + year, df, se = "cluster")


produce nearly identical coefficient estimates for `treat` but marginal effects that are so different?

I have to admit, I'm not sure exactly what is going on with the `offset()` in the calculation of the marginal effects. Maybe I've missed an important step? Of course, the TWFE approach allows one to employ the coefficient estimate on `treat` as ATT without calculating marginal effects (notwithstanding the TWFE research that's been produced over the last 5 years). However, this shows that calculating the ATT via Wooldridge (2022) and `etwfe` might not be so straightforward when employing an offset term for exposure.

All of that being said, a practical solution to my problem is to simply model wildlife strike counts and control for aircraft movements by including it as a traditional covariate. Alternatively, I could calculate wildlife strike rates--i.e., strikes divided by aircraft movements--and use that as my dependent variable in a linear model. Even so, it would be helpful to figure out what's going on here... 🤷‍♂️ 
mariofiorini commented 1 year ago

Hi @grantmcdermott , I installed marginaleffects dev version following @vincentarelbundock feedback on Error when using marginaleffects with fepois+offset. Now

data(trade)
gravity_pois_offset = fepois(data = trade, Euros ~ 1 | Origin , offset = ~log(dist_km))
marginaleffects::slopes(gravity_pois_offset)

works. However,

data("mpdta", package = "did")
mpdta$emp = exp(mpdta$lemp)

etwfe(
  emp ~ 0, tvar = year, gvar = first.treat, data = mpdta, vcov = ~countyreal,
  family = "poisson", offset = ~log(lpop)
) |>
  emfx()

Still returns an error

The variables '.Dtreat:first.treat::2006:year::2004', '.Dtreat:first.treat::2006:year::2005' and three others have been removed because of collinearity (see $collin.var).
Error: The function supplied to the `transform_pre` argument must accept two numeric vectors
  of predicted probabilities of length 0, and return a single numeric value or a numeric
  vector of length 0, with no missing value.
In addition: Warning message:
In formals(fun) : argument is not a function

Almost as if it was using the older version, since this is the same error in Error when using marginaleffects with fepois+offset. I am probably missing something obvious. In terms of loaded packages I have

marginaleffects_0.10.0.9012 etwfe_0.2.0 Cheers, Mario

P.S. Again, without offset it works

etwfe(
    emp ~ 0, tvar = year, gvar = first.treat, data = mpdta, vcov = ~countyreal,
    family = "poisson"
  ) |>
    emfx()

The variables '.Dtreat:first.treat::2006:year::2004', '.Dtreat:first.treat::2006:year::2005' and three others have been removed because of collinearity (see $collin.var).

    Term                 Contrast .Dtreat Estimate Std. Error     z Pr(>|z|) 2.5 % 97.5 %
 .Dtreat mean(TRUE) - mean(FALSE)    TRUE    -28.9       20.6 -1.41     0.16 -69.2   11.4

Columns: term, contrast, .Dtreat, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo 
vincentarelbundock commented 1 year ago

I’m not exactly sure what the solution is, but I investigated a bit and found a few things.

Important: This uses the Github version of marginaleffects.

library(did)
library(etwfe)
library(fixest)
library(marginaleffects)

data("mpdta", package = "did")
mpdta$emp <- exp(mpdta$lemp)

mod <- etwfe(
  emp ~ 0, tvar = year, gvar = first.treat, data = mpdta, vcov = ~countyreal,
  family = "poisson", offset = ~log(lpop))

Internally, marginaleffects uses the get_modeldata() function to extract the fit data. Unfortunately, we don’t get the offset column, which means that fixest breaks when trying to make predictions:

marginaleffects:::get_modeldata(mod) |> head()
#      emp .Dtreat year first.treat
# 866 4729   FALSE 2003        2007
# 841 4175   FALSE 2004        2007
# 842 4189   FALSE 2005        2007
# 819 4351   FALSE 2006        2007
# 827 4853    TRUE 2007        2007
# 937  148   FALSE 2003        2007

predict(mod, newdata = marginaleffects:::get_modeldata(mod))
# Error in predict.fixest(mod, newdata = marginaleffects:::get_modeldata(mod)): 
# Predict can't be applied to this estimation because the offset (..2) cannot be evaluated for the new data. 
# Use a formula for the offset in the first estimation to avoid this.

The reason why we don’t get the offset column, is that insight::find_offset() does not return useful info:

insight::find_offset(mod)
# [1] "..2"

I suspect that the ..2 comes from etwfe, but I’m not sure how to ensure that etwfe passes along all the required columns to make predictions on this kind of model.

grantmcdermott commented 1 year ago

Thanks all, bug confirmed.

Interesting it picks up the correct offset directly from the model object.

> mod$offset |> head()
[1] 1.774403 1.774403 1.774403 1.774403 1.774403 0.803067

But not when passing through the call environment.

> mod$call$offset |> head()
Error in x[seq_len(n)] : object of type 'symbol' is not subsettable
> mod$call$offset
..2

(The ..<something> notation is referencing a fixest macro, which is how it reconciles fomulas in that call environment.)

I don't have time to look right now, but I'll try to troubleshoot and fix later this evening.

P.S. @mariofiorini It won't fix your issue outright, but I see you're using 0.2.0 of etwfe. Please install the latest version 0.3.1 from CRAN, since it aligns some new features with the latest marginaleffects release.

mariofiorini commented 1 year ago

thanks @grantmcdermott . Will update etwfe. Hope fixing it is not too much work.

etiennebacher commented 1 year ago

In this case, the offset name can be obtained with mod$model_info$offset:

library(did)
library(etwfe)
library(fixest)
library(marginaleffects)

data("mpdta", package = "did")
mpdta$emp <- exp(mpdta$lemp)

mod <- etwfe(
    emp ~ 0, tvar = year, gvar = first.treat, data = mpdta, vcov = ~countyreal,
    family = "poisson", offset = ~log(lpop))
#> The variables '.Dtreat:first.treat::2006:year::2004', '.Dtreat:first.treat::2006:year::2005' and three others have been removed because of collinearity (see $collin.var).

mod$model_info$offset
#> [1] "log(lpop)"

Not sure if this is the case for all fixest objects though, just putting it here if it helps. And this should probably be fixed in insight.

vincentarelbundock commented 1 year ago

Thanks @etiennebacher, this is helpful.

I just pushed a commit to insight which now allows this:

library(did)
library(etwfe)
library(fixest)

packageVersion("insight")
# [1] '0.19.0.9'

data("mpdta", package = "did")
mpdta$emp <- exp(mpdta$lemp)

mod <- feols(lemp ~ 1 | year, offset = ~lpop, data = mpdta)
insight::find_offset(mod)
# [1] "lpop"

mod <- etwfe(
    emp ~ 0, tvar = year, gvar = first.treat, data = mpdta, vcov = ~countyreal,
    family = "poisson", offset = ~log(lpop))
insight::find_offset(mod)
# [1] "lpop"

However, note that we still can’t extract the offset data itself because etwfe modifies the mpdta dataset under the hood before it fits the model, and doesn’t appear to retain the offset column. Maybe etwfe can use (the dev version of) insight to figure out what the offset is called and retain it in the data frame it feeds to fixest.

insight::get_data(mod, additional_variables = TRUE) |> head()
#      emp .Dtreat year first.treat
# 866 4729   FALSE 2003        2007
# 841 4175   FALSE 2004        2007
# 842 4189   FALSE 2005        2007
# 819 4351   FALSE 2006        2007
# 827 4853    TRUE 2007        2007
# 937  148   FALSE 2003        2007
grantmcdermott commented 1 year ago

Hi all,

Sorry for the slow response on my side. Day job and sick kids keeping me very busy. Needless to say, I appreciate everyone weighing in during my absence.

I think @vincentarelbundock has the gist of it and I can add a catch to the emfx() internals to pull out the offset if a user provides it. But I think this might worth a patch higher up (i.e., either marginaleffects or insight). Consider the following "manual" implementation of ETWFE. Note that I'm using regular OLS to highlight that the problem isn't related to the model family being Poisson. UPDATE: Oops, sorry mis-copied the offset in my haste last night. It should be lpop rather than log(lpop). With this correction, the manual implementation passes through correctly. Updated code below.

library(fixest)
library(marginaleffects)

data("mpdta", package = "did")

mpdta2 = mpdta |>
  transform(
    .Dtreat = year >= first.treat & first.treat != 0
  )

## Linear with offset --

mod_lin = feols(
  lemp ~ .Dtreat:i(first.treat, i.year, ref = 0, ref2 = 2003) |
    first.treat + year,
  data = mpdta2,
  vcov = ~countyreal,
  offset = ~lpop
)
#> The variables '.Dtreat:first.treat::2006:year::2004', '.Dtreat:first.treat::2006:year::2005' and three others have been removed because of collinearity (see $collin.var).

avg_slopes(mod_lin, variable = ".Dtreat")
#> 
#>     Term     Contrast Estimate Std. Error     z Pr(>|z|)    2.5 %   97.5 %
#>  .Dtreat TRUE - FALSE -0.00555    0.00154 -3.59   <0.001 -0.00858 -0.00253
#> 
#> Columns: term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high

## Poisson with offset --

mod_pois = fepois(
  lemp ~ .Dtreat:i(first.treat, i.year, ref = 0, ref2 = 2003) |
    first.treat + year,
  data = mpdta2,
  vcov = ~countyreal,
  offset = ~lpop
)
#> The variables '.Dtreat:first.treat::2006:year::2004', '.Dtreat:first.treat::2006:year::2005' and three others have been removed because of collinearity (see $collin.var).

avg_slopes(mod_pois, variable = ".Dtreat")
#> 
#>     Term     Contrast Estimate Std. Error    z Pr(>|z|)    2.5 %   97.5 %
#>  .Dtreat TRUE - FALSE -0.00573    0.00163 -3.5   <0.001 -0.00893 -0.00252
#> 
#> Columns: term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high

Created on 2023-03-12 with reprex v2.0.2

P.S. I'm using the dev version of both marginaleffects and insight.

sessionInfo ```r sessionInfo() #> R version 4.2.2 (2022-10-31) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Arch Linux #> #> Matrix products: default #> BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.21.so #> #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 #> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] marginaleffects_0.11.0.9000 fixest_0.11.1 #> #> loaded via a namespace (and not attached): #> [1] Rcpp_1.0.10 dreamerr_1.2.3 compiler_4.2.2 #> [4] R.methodsS3_1.8.2 R.utils_2.12.2 tools_4.2.2 #> [7] digest_0.6.31 evaluate_0.20 lifecycle_1.0.3 #> [10] nlme_3.1-160 checkmate_2.1.0 R.cache_0.16.0 #> [13] lattice_0.20-45 rlang_1.0.6 reprex_2.0.2 #> [16] cli_3.6.0 rstudioapi_0.14 yaml_2.3.7 #> [19] xfun_0.36 fastmap_1.1.0 withr_2.5.0 #> [22] styler_1.9.0 knitr_1.42 generics_0.1.3 #> [25] fs_1.6.0 vctrs_0.5.2 grid_4.2.2 #> [28] glue_1.6.2 data.table_1.14.9 rmarkdown_2.20 #> [31] Formula_1.2-5 purrr_1.0.1 magrittr_2.0.3 #> [34] backports_1.4.1 htmltools_0.5.4 insight_0.19.0.9 #> [37] numDeriv_2016.8-1.1 sandwich_3.0-2 zoo_1.8-11 #> [40] R.oo_1.25.0 ```

Created on 2023-03-11 with reprex v2.0.2

vincentarelbundock commented 1 year ago

Ha! I just pushed a fix to insight that allows ~log(pop). Glad to see it working now. But when I try with etwfe, insight finds the column in the get_data() output:

library(etwfe)
data("mpdta", package = "did")
mpdta$emp <- exp(mpdta$lemp)
mod <- etwfe(
    emp ~ 0, tvar = year, gvar = first.treat, data = mpdta, vcov = ~countyreal,
    family = "poisson", offset = ~log(lpop))

insight::find_offset(mod)
# [1] "lpop"

insight::find_variables(mod)
# $response
# [1] "emp"
# 
# $conditional
# [1] ".Dtreat" "year"   
# 
# $cluster
# [1] "first.treat" "year"

insight::get_data(mod, additional_variables = TRUE) |> head()
# Warning: Could not recover model data from environment. Please make sure your
#   data is available in your workspace.
#   Trying to retrieve data from the model frame now.
#      emp .Dtreat year first.treat
# 866 4729   FALSE 2003        2007
# 841 4175   FALSE 2004        2007
# 842 4189   FALSE 2005        2007
# 819 4351   FALSE 2006        2007
# 827 4853    TRUE 2007        2007
# 937  148   FALSE 2003        2007
grantmcdermott commented 1 year ago

Thanks @vincentarelbundock, really appreciate you taking a look on the backend for me :-)

So, this is proving trickier to debug than I first thought. I'm now thinking that the problem isn't with emfx() per se, but rather that the etwfe/fixest call environment is masked from marginaleffects in general. Here's a simple example. Key takeaway is at the end; we can't run avg_slopes(mod) even though it is (nominally) just a fixest object.

library(etwfe)
library(fixest)
library(marginaleffects)

data("mpdta", package = "did")

## etwfe version
mod = etwfe(
  lemp ~ 1, tvar=year, gvar=first.treat, data=mpdta,
  vcov = ~countyreal,
  offset = ~lpop
) 

## "manual" version
mod2 = feols(
  lemp ~ .Dtreat:i(first.treat, i.year, ref = 0, ref2 = 2003) |
    first.treat + year,
  data = mpdta |>
    transform(
      .Dtreat = year >= first.treat & first.treat != 0
    ),
  vcov = ~countyreal,
  offset = ~lpop
)
#> The variables '.Dtreat:first.treat::2006:year::2004', '.Dtreat:first.treat::2006:year::2005' and three others have been removed because of collinearity (see $collin.var).

## same coef result for both models
all.equal(mod$coeftable, mod2$coeftable)
#> [1] TRUE

## not true for the call
all.equal(mod$call, mod2$call)
#> [1] "target, current do not match when deparsed"

## avg_slopes() only works directly on the manual version...

avg_slopes(mod, variable = ".Dtreat")
#> Warning in formals(fun): argument is not a function
#> Error: The function supplied to the `comparison` argument must accept two
#>   numeric vectors of predicted probabilities of length 0, and return a
#>   single numeric value or a numeric vector of length 0, with no missing
#>   value.

avg_slopes(mod2, variable = ".Dtreat")
#> 
#>     Term     Contrast Estimate Std. Error     z Pr(>|z|)    2.5 %   97.5 %
#>  .Dtreat TRUE - FALSE -0.00555    0.00154 -3.59   <0.001 -0.00858 -0.00253
#> 
#> Columns: term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high

Created on 2023-03-12 with reprex v2.0.2

vincentarelbundock commented 1 year ago

A hack might be to store the data as an attribute, and then emfx would pass the data explicitly to the newdata argument in slopes. We only need access to the environment if we have to guess where the data is, not when the user feeds it to us explicitly.

grantmcdermott commented 1 year ago

A hack might be to store the data as an attribute, and then emfx would pass the data explicitly to the newdata argument in slopes. We only need access to the environment if we have to guess where the data is, not when the user feeds it to us explicitly.

True. I've resisted doing this because I don't want to create large model objects. But maybe it isn't a practical constraint.

Another option might be passing envir = mod$call_env through ... in avg_slopes() and co. (With an implicit assumption that this gets passed further up to insight::get_data?) I have to start prepping dinner, but will mull on this a bit more and see if I have a reasonable implementation for a PR.

grantmcdermott commented 1 year ago

Hi all, I just pushed a commit to this repo (https://github.com/grantmcdermott/etwfe/commit/f095ae57670f271e8f7f272d8f74166821baf1b4) as well as a PR to the insight repo (https://github.com/easystats/insight/pull/759) that fixes the offset error.

In order to test, please make sure that you grab the development version of both packages.

install.packages(
   c("etwfe", "insight"), 
   repos = c("https://grantmcdermott.r-universe.dev", "https://easystats.r-universe.dev")
)

Here's a MWE example with these latest updates.

library(etwfe)

data("mpdta", package = "did")
mpdta$emp = exp(mpdta$lemp)

omod = etwfe(
  lemp ~ 1, tvar=year, gvar=first.treat, data=mpdta,
  vcov = ~countyreal,, family = "poisson",
  offset = ~lpop
) 
#> The variables '.Dtreat:first.treat::2006:year::2004', '.Dtreat:first.treat::2006:year::2005' and three others have been removed because of collinearity (see $collin.var).

omod |> emfx()
#> 
#>     Term                 Contrast .Dtreat Estimate Std. Error    z Pr(>|z|)
#>  .Dtreat mean(TRUE) - mean(FALSE)    TRUE  -0.0492      0.014 -3.5   <0.001
#>    2.5 %  97.5 %
#>  -0.0767 -0.0217
#> 
#> Columns: term, contrast, .Dtreat, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo

Created on 2023-04-29 with reprex v2.0.2

Note for posterity: I don't think that adding lpop as an offset to this particular model will affect the response variable because lpop is constant within the level of the fixed effects (in particular, first.treat). In other words, you will get the same result with and without this offset. I'd appreciate it if someone could test with their own data and report back.

grantmcdermott commented 1 year ago

@mariofiorini @levialtringer Do you two mind testing the updated offset behaviour with your own datasets? (As per my previous comment, just make sure to grab the dev version of both insight and etwfe.)

I'd like to submit an update to CRAN as soon as possible, partly because of an unrelated bug fix for heterogeneous ATTs. So it would be good to confirm that this offset behaviour is working too.

mariofiorini commented 1 year ago

Hi @grantmcdermott , I can confirm that the update seems to have fixed the problem. FYI, I have also tested it on simple simulated data against the imputation method of Borusyak, Jaravel, and Spiess "Revisiting Event Study Designs: Robust and Efficient Estimation" (R&R Review of Economic Studies) and they are numerically equivalent. The advantage of etwfe is that you get the standard error directly, especially in the non-linear case. Happy to make an R markdown version of the comparison if you think it's useful.

grantmcdermott commented 1 year ago

Huzzah, thanks for confirming @mariofiorini. I can't promise that I'll include the Rmd as a formal vignette (though I might). But I'd love to see it all the same. Please ping me when it's ready.

grantmcdermott commented 1 year ago

P.S. These updates are in v0.3.2 of etwfe, which has just been released on CRAN.

(Not sure when insight will be updated, but they adopt a pretty frequent release cycle. So my guess is soon.)