Enhancement: customize coefficients to aggregate

mikenguyen13 commented 1 year ago

Hi, I am running stacked DID using feols. I am interested in aggregating my coefficients. However, the aggregate function is only for sunab. I was wondering if you could add a way for users to customize the names of the coefficients to be aggregated. Say in my case, I just want rel_year_1, rel_year_2, rel_year_3.

stacked <- feols(c(log1p_n_filing, log1p_n_publication)~`rel_year_-3` + `rel_year_-2` + rel_year_0 + rel_year_1 + rel_year_2 + rel_year_3|company_name + df + year,vcov = ~ company_name + df,data = stacked_data)

and here is my result


OLS estimation, Dep. Var.: log1p_n_filing
Observations: 338,079 
Fixed-effects: country: 15,  industry_category: 22,  company_name: 5,818,  df: 16,  year: 18
Standard-errors: Clustered (company_name & df) 
               Estimate Std. Error   t value Pr(>|t|)    
`rel_year_-3` -0.000119   0.000636 -0.186431 0.854605    
`rel_year_-2` -0.000177   0.000558 -0.316697 0.755838    
rel_year_0     0.000844   0.000832  1.014279 0.326532    
rel_year_1     0.002375   0.000933  2.545001 0.022419 *  
rel_year_2     0.001403   0.001123  1.248830 0.230866    
rel_year_3     0.001398   0.000964  1.450110 0.167615    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.03212     Adj. R2: 0.824622
                Within R2: 8.454e-5`

kylebutts commented 1 year ago

Hi @mikenguyen13, two options here:

You can pool them in estimation (rel_year_1_2_or_3 or some better name).
You can use the delta method to get standard errors on the average terms, see https://github.com/vincentarelbundock/marginaleffects for a very simple way

marginaleffects::hypotheses(stacked, "mean(c(rel_year_1, rel_year_2, rel_year_3))")

mikenguyen13 commented 1 year ago

Hi @kylebutts , Thank you for the pointer. I am trying the second option, but I don't think I can specify the mean inside the hypotheses function. I tried to do it with my code. And I received this error:

``> marginaleffects::hypotheses(stacked, "mean(c(rel_year_1, rel_year_2, rel_year_3))") Error: Assertion failed. One of the following must apply:

checkmate::check_string(hypothesis): Must comply to pattern '='
checkmate::check_choice(hypothesis): Must be element of set
{'pairwise','reference','sequential','revpairwise','revreference','revsequential'}, but is
'mean(c(rel_year_1, rel_year_2, rel_year_3))'
checkmate::check_numeric(hypothesis): Must be of type 'numeric', not 'character'
checkmate::check_matrix(hypothesis): Must be of type 'matrix', not 'character'
checkmate::check_null(hypothesis): Must be NULL``

Then, I tried to do it with the fixest package example, and I received the same error:

``> res_sa20 = feols(y ~ x1 + sunab(year_treated, year) | id + year, base_stagg)

marginaleffects::hypotheses(res_sa20, "mean(c(year::-9 , year::-8)") Error: Assertion failed. One of the following must apply:

checkmate::check_string(hypothesis): Must comply to pattern '='

checkmate::check_choice(hypothesis): Must be element of set

{'pairwise','reference','sequential','revpairwise','revreference','revsequential'}, but is

'mean(c(year::-9 , year::-8)'

checkmate::check_numeric(hypothesis): Must be of type 'numeric', not 'character'

checkmate::check_matrix(hypothesis): Must be of type 'matrix', not 'character'

checkmate::check_null(hypothesis): Must be NULL``

From the marginaleffects package, I saw that I could only specify contrast or comparison. I couldn't find where I could specify the average effect. I hope you can provide some suggestions here.

kylebutts commented 1 year ago

Sorry, I was off on my syntax on specifying the null. You need a = 0 in there. Here's a reprex:

library(fixest)
library(marginaleffects)
es = feols(
  y ~ i(time_to_treatment, ref = c(-1, -1000)) | id + year, 
  base_stagg
)
hypotheses(
  es, 
  hypothesis = "mean(c(`time_to_treatment::1`, `time_to_treatment::2`, `time_to_treatment::3`)) = 0"
)
#> 
#>                                                                                 Term
#>  mean(c(`time_to_treatment::1`, `time_to_treatment::2`, `time_to_treatment::3`)) = 0
#>  Estimate Std. Error     z Pr(>|z|)    S 2.5 % 97.5 %
#>     -2.08      0.484 -4.31   <0.001 15.9 -3.03  -1.14
#> 
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high

^{Created on 2023-08-09 with reprex v2.0.2}

mikenguyen13 commented 1 year ago

Thank you so much. It works perfectly. However, I have another question. From my understanding, the aggregate and hypotheses functions should give identical results. And from the fixest package documentation, the aggregate function does give the estimate and compare with the true effect. But when I have a sanity check with the hypotheses function by taking the mean of the treated period coefficients, I have a different result of 0.33, when in fact, the att should be - 1. I hope you can help if I misunderstand anything.

I use your exact code as above and add the one below

aggregate(es, c("att" = "treatment::[^-]"))
# Estimate Std. Error   t value    Pr(>|t|)
# att -1.362488  0.4981501 -2.735096 0.007454516
with(base_stagg, mean(treatment_effect_true[time_to_treatment >= 0]))
# [1] -1
hypotheses(
  es, 
  hypothesis = "mean(c(
  `time_to_treatment::0`,
  `time_to_treatment::1`, `time_to_treatment::2`, `time_to_treatment::3`, `time_to_treatment::4`, `time_to_treatment::5`, `time_to_treatment::6`, `time_to_treatment::7`, `time_to_treatment::8`
  )) = 0"
)
# Term
# mean(c(\n  `time_to_treatment::0`,\n  `time_to_treatment::1`, `time_to_treatment::2`, `time_to_treatment::3`, `time_to_treatment::4`, `time_to_treatment::5`, `time_to_treatment::6`, `time_to_treatment::7`, `time_to_treatment::8`\n  )) = 0
# Estimate Std. Error     z Pr(>|z|)   S  2.5 % 97.5 %
#   0.331      0.558 0.594    0.552 0.9 -0.762   1.42
# 
# Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high

lrberge commented 1 year ago

(I'm on vacations and ready to hike so I have to be brief [sorry I'm in Europe 😉].)

1) Thanks a lot Kyle for chiming in! 2) aggregate is different from hypotheses. It's done a la Sun & Abraham and takes into account the population distribution for the timings. Not the same in H.

mikenguyen13 commented 1 year ago

Thank you so much for your swift response. Since "aggregate" is different from "hypotheses," going back to my original question: do you have any suggestions for an approach to be able to specify which coefficients to aggregate?

lrberge commented 1 year ago

Sure, summary has an 'agg' argument, so you just need to write the "right" one (note to self: I should add the same argument to feols). The syntax is a bit "weird" at first sight b/c it uses regular expressions. In your example it would be:

summary(stacked, agg = c("att" = "rel_year_[0123]"))

Can't give a reprex sorry, I've no computer! But I hope it works.

mikenguyen13 commented 1 year ago

Yes, that works wonders. Thank you so much!

kylebutts commented 1 year ago

For the sake of writing it down, the mean itself of rel_year_1, rel_year_2, and rel_year_3 is -0.33, but your testing the difference between 0 and -0.33 which is 0.33. So aggregate is still different (as @lrberge pointed out), but it's not that different :-)

lrberge / fixest

Enhancement: customize coefficients to aggregate #434