[BUG] Different SE's on main regressors using time series operator in absorb() vs as "regular" control in main line

hieronymusBusch commented 3 months ago

Stata version: 18.0 16jul2024 OS: Windows 10

Hi, this is my first report so I am sorry for any mistakes / misunderstandings.

We (Daniele Girardi & I) are building on reghdfe for our lpdid (local projections difference-in-differences) command. Your command is a great help and immensely speeds up the command. We are grateful for your contribution to the community!

Since our outcome is first-differenced, we want to give our users some guidelines in how to enter control variables (i.e. enter them first-differenced in most cases). The most intuitive process in Stata would be to simply add the "D." operator to a control variable. However, this is when we noticed that the standard errors on the main regressor are sensitive to where and how we specify such a control.

Expected Behaviour:

We would expect the point estimates and standard errors on the main regressors to be the same, regardless of how users enter their first-differenced control variable. Specifically, it should not matter for SE on the main regressor whether users first-difference their control variable themselves before running the command & then include it as control in absorb() or in the main line, or do the same without manually first-differencing but using the operator "D." instead.

Actual Behaviour:

Standard errors differ when using the "D." operator in the absorb() option. Point estimates remain identical. While the provided example does not change drastically, in one of our empirical applications the change was more notable. The change in SE cannot be explained by rounding.

Output / Example:

I report the standard error on the main variable "treat" using different specifications. In this example, we add linear time trends to the diff-in-diff estimation. First-differencing a group-specific linear time trend is equivalent to adding group FE.

clear * set seed 12345 set obs 1000

gen y = uniform() gen group = floor(_n/50) bys group: gen time = _n gen treat = 0 replace treat = 1 if group>10 & time>15

xtset group time

using the Stata-provided regress command regress D.y treat i.group i.time regress D.y treat D.c.time#i.group i.time
SE = .05779
using reghdfe mirroring regress command from above reghdfe D.y treat , absorb(group time) reghdfe D.y treat D.c.time#i.group , absorb(time)
SE = .05779 (same as above)
entering controls with D. in absorb() reghdfe D.y treat , absorb(D.c.time#i.group time) *> SE = .0578218 (different to above)

NilsEnevoldsen commented 3 months ago

I don't know why, but degrees of freedom are computed slightly differently when both D.c.time#i.group and i.time are in absorb(). It feels like that should represent 68 absorbed degrees of freedom, not 69, but I'm not confident.

You can adjust e(V) to account for this.

reghdfe D.y treat D.c.time#i.group i.time, noabsorb
di e(df_m)
di e(df_a)
di e(df_r)

reghdfe D.y treat D.c.time#i.group, absorb(i.time)
di e(df_m)
di e(df_a)
di e(df_r)

reghdfe D.y treat i.time, absorb(D.c.time#i.group)
di e(df_m)
di e(df_a)
di e(df_r)

reghdfe D.y treat, absorb(D.c.time#i.group i.time)
di e(df_m)
di e(df_a)
di e(df_r)

matrix V = e(V)
matrix V = V * e(df_r) / (e(df_r)+1)
erepost V = V
estimates replay

hieronymusBusch commented 3 months ago

Thank you for the comment! Without venturing into the code of reghdfe, my best guess is then that this behaviour originates in how Stata generates e(sample) when using time series operators (see also https://www.statalist.org/forums/forum/general-stata-discussion/general/1756480-complete-sample-sizes-and-e-sample-with-lags-and-leads).

sergiocorreia / reghdfe

[BUG] Different SE's on main regressors using time series operator in absorb() vs as "regular" control in main line #286