sergiocorreia / reghdfe

Linear, IV and GMM Regressions With Any Number of Fixed Effects
http://scorreia.com/software/reghdfe/
MIT License
219 stars 57 forks source link

Constant in model can affect standard error estimation for normal coefficients #225

Closed joachim-gassen closed 3 years ago

joachim-gassen commented 3 years ago

Hi there:

First of all: Thank you for this amazing package! As you are pretty well aware it has become the workhorse for panel model estimation for many people. This is also why I have been working on a blog post about how its standard errors can be reconciled with some of the prominent R packages in this field (fixest, lfe, sandwich, plm).

Doing so I noticed that the twoway clustered standard errors of reghdfe for fixed effect panel models differ at times from the ones of fixest and lfe, even if samples are completely balanced and the same degree of freedom adjustments are applied. Most of the time (like in the example below) the differences are very minor but they can also become sizable on small samples (I encountered more than 10 %).

Here is my guess why it happens: You are by default estimating a constant intercept with standard error even when fixed effects are present in the model. At times the resulting VCE becomes non-positive semi-definite, causing the application of the Cameron, Gelbach & Miller (2011, p. 241 f.) fix while it would not been necessary to apply this fix if all the fixed effects would have been projected out. This causes changes also in the SE of the independent variable of interest. When I estimate your model with the undocumented noconstant option, then the SE are identical with the ones of {fixest} and {lfe} for balanced panels and regular fixed effects (with the range of numerical precision). Am I making sense here? Thank you for pointing out anything that I might have missed.

Assuming that my hunch is correct: I do not see why reporting a constant for fixed effect models has become the default, in particular as it is having this side effect. In any case it would have helped me to see a warning or something similar in the documentation.

See below for an example based on the Petersen data set. You can see a somewhat more complete discussion in my blog post in case you are interested.

Thank you again for your great work!

Joachim

. update all
(contacting http://www.stata.com)

Update status
    Last check for updates:  01 Mar 2021
    New update available:    none         (as of 01 Mar 2021)
    Current update level:    03 Feb 2020  (what's new)

Possible actions

    Do nothing; all files are up to date.

. which reghdfe
/Users/joachim/Library/Application Support/Stata/ado/plus/r/reghdfe.ado
*! version 5.9.0 03jun2020

. webuse set https://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/
(prefix now "https://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se")

. webuse test_data.dta, clear

. reghdfe y x, absorb(firmid year) cluster(firmid year)
(MWFE estimator converged in 2 iterations)
Warning: VCV matrix was non-positive semi-definite; adjustment from Cameron, Gelbach & Miller 
> applied.

HDFE Linear regression                            Number of obs   =      5,000
Absorbing 2 HDFE groups                           F(   1,      9) =    1068.27
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.6506
                                                  Adj R-squared   =     0.6109
Number of clusters (firmid)  =        500         Within R-sq.    =     0.1913
Number of clusters (year)    =         10         Root MSE        =     1.4051

                           (Std. Err. adjusted for 10 clusters in firmid year)
------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .9700493   .0296793    32.68   0.000       .90291    1.037189
       _cons |   .0300277   .0001965   152.79   0.000     .0295831    .0304723
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
      firmid |       500         500           0    *|
        year |        10          10           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

. estat vce

Covariance matrix of coefficients of reghdfe model

        e(V) |          x       _cons 
-------------+------------------------
           x |  .00088086             
       _cons |  5.833e-06   3.862e-08 

. reghdfe y x, absorb(firmid year) cluster(firmid year) noconstant
(MWFE estimator converged in 2 iterations)

HDFE Linear regression                            Number of obs   =      5,000
Absorbing 2 HDFE groups                           F(   1,      9) =    1068.29
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.6506
                                                  Adj R-squared   =     0.6109
Number of clusters (firmid)  =        500         Within R-sq.    =     0.1913
Number of clusters (year)    =         10         Root MSE        =     1.4051

                           (Std. Err. adjusted for 10 clusters in firmid year)
------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .9700493    .029679    32.68   0.000     .9029107    1.037188
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
      firmid |       500         500           0    *|
        year |        10          10           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

. estat vce

Covariance matrix of coefficients of reghdfe model

        e(V) |          x 
-------------+------------
           x |  .00088084 
sergiocorreia commented 3 years ago

Hi Joachim,

Thanks a lot for your PR! Also, your blog post was extremely useful in understanding the problem.

Why it does is beyond me, given that this constant cannot be interpreted in a meaningful way without diving into the internals of the fixed effect structure.

For several years I resisted adding _cons as I thought it was a completely useless piece of information, and just complicated my code. But the margins command didn't work without it, so eventually I added it. According to readme.md:

  • version 5.0 29jun2018:
    • Added support for margins postestimation command.
    • Added _cons row to output table, so the intercept is reported (as in regress/xtreg/areg). The noconstant option disables this, but doing so might make the output of margin incorrect.

As I mention in the help file, this should be a purely aesthetic option, but it turns out it might not be under multi-way clustering if the variance matrices end up not being positive-semidefinite (which is more likely in small sample sizes).

Assuming that my hunch is correct: I do not see why reporting a constant for fixed effect models has become the default, in particular as it is having this side effect. In any case it would have helped me to see a warning or something similar in the documentation.

Agree. To fix this, I removed the side effect and now it shouldn't matter to the results if you run with noconstant or not.

In particular, the mata reghdfe_fix_psd() function will only work on the submatrix that excludes _cons. This means that the entire matrix might not be positive-semidefinite due to the _cons row/col. However, we shouldn't care about this, as this should only matter if you are running a test that involves a regressor together with cons. For instance:

reghdfe price weight length
test weight = length // this is correct
test weight = _cons // this might be problematic

One last thing: I'm in the middle of a major update to reghdfe, so the fix was pushed into the reghdfe6 branch instead of the master branch. Could you check it out to see if things now work on your side? To install you need to uninstall+reinstall ftools and reghdfe as follows:

* Install ftools (remove program if it existed previously)
cap ado uninstall ftools
net install ftools, from("https://raw.githubusercontent.com/sergiocorreia/ftools/groupreg/src/")

* Install reghdfe 6.x
cap ado uninstall reghdfe
net install reghdfe, from("https://raw.githubusercontent.com/sergiocorreia/reghdfe/reghdfe6/src/")

Many thanks again! Sergio

sergiocorreia commented 3 years ago

Also, as reference, this statalist thread goes into a bit more detail into why margins needs _b[_cons]:

https://www.statalist.org/forums/forum/general-stata-discussion/general/1450935-how-to-add-support-for-margins-in-a-user-written-command?p=1451168#post1451168

joachim-gassen commented 3 years ago

Hi Sergio:

Unbelievable! Thank you for coming back to me so quickly. I just verified that your fix is perfectly addressing the problem. See:

reghdfe_fixest_fixed

These are differences between the R fixest package and reghdfe for the new version version 6.0.5 01Mar2021. All within numerical precision range. Super! Thanks for pointing out that the margins command needs the constant to be there. This makes sense.

One minor thing (not sure whether it matters): The table that Stata returns after running reghdfe now does not report a standard error for the constant anymore. This makes sense to me as the standard error for the constant term cannot be calculated without running into the problems outlined above but it might confuse people. Is the SE needed for the margin command or does this simply operate on _b[]? Also, I tested test x = _cons and this returns an F-test although I am not entirely sure how it can do that given that there is no V.

Thank you again for responding so quickly. Let me know when there is something else I can do to help.

Cheers!

Joachim

sergiocorreia commented 3 years ago

Hi Joachim,

Glad it works on your side!

These are differences between the R fixest package and reghdfe for the new version version 6.0.5 01Mar2021. All within numerical precision range. Super! Thanks for pointing out that the margins command needs the constant to be there. This makes sense.

I wonder if these differences would go away if you increase the tolerance, with e.g. "tol(1e-14)" instead of the default of 1e-8 (and same for fixest). Otherwise, it might help to pinpoint other discrepancies

One minor thing (not sure whether it matters): The table that Stata returns after running reghdfe now does not report a standard error for the constant anymore. This makes sense to me as the standard error for the constant term cannot be calculated without running into the problems outlined above but it might confuse people. Is the SE needed for the margin command or does this simply operate on _b[]? Also, I tested test x = _cons and this returns an F-test although I am not entirely sure how it can do that given that there is no V.

Great catch. I just added an update that does the update in two steps:

  1. Fix the non-positive-semidefiniteness of V (e.g. V = fix_psd(V))
  2. Do the same fix, but for the submatrix of V that excludes the _cons column/row (i.e. the minor of the matrix excluding the last row and last column), and replace this fix inside V (e.g. V[1..K, 1..K] = fix_psd(backup_V[1..K, 1..K]))

This means that the results "where it matters" (regressors other than _cons) are the same, but we are still able to see SEs for _cons. This seems to work and not confuse readers, but I'm still a bit on the fence on how "correct" this approach is

joachim-gassen commented 3 years ago

Hi Sergio:

Thank you and apologies for the delay in coming back to you. I checked the results for tol(2.220446e-12) which is the minimum amount that the fixed.tol parameter allows for my machine. The results look very similar (differences within +-5e-9 with some slightly larger values for small samples). For me this seems consistent with R using 64bit floats and the handshake between R and Stata causing additional noise. In other words: Good enough for me ;-)

For what it is worth, I like the approach that you took. While it generates two standard errors for one model that depend on different Vs, it avoids the unnecessary psd fix for the main coefficients and makes margin() happy. Of course an alternative would be not to report a constant by default and instead to require a constant option for situations wehere users want to use margins with a consistent V but this is a mere design choice.

Thank you again for providing feedback so quickly and for maintaining this super important package so well!

Joachim

sergiocorreia commented 3 years ago

No problem at all, and Im very happy that the program is being validated externally, and that so many options now exist in R/Py/Julia.