Open EBoisseauSierra opened 5 months ago
cov_cluster
in stats.sandwich_covariance
uses:
nobs, k_params = xu.shape
so it assumes number of parameters is the same as exog
and uses it in the small sample correction
if use_correction:
cov_c *= (n_groups / (n_groups - 1.) *
((nobs-1.) / float(nobs - k_params)))
It's a bug, kind of, given that we don't have an official way for user to adjust df, ddof #2799
But this might also be a bug for constrained estimation, fit_constrained
.
(counting params and df_xxx is still a mess #1723 #1664 absorb/ddof adds another case for differences across definitions)
When I checked absorb in the past, I never looked at robust cov_types. So there might be more problems than just in cluster.
This might also be a bug in models with extra params, negative binomial, zip, ... maybe not: xu should be the moment conditions in the general case. In the exactly identified case, the number of parameters is equal to the number of moment conditions.
update cov_white_simple and the panel cov also uses k_params this way for small sample correction.
possible solution
define and return df_resid in _get_sandwich_arrays
, use model attribute if available, and xu.shape for tuple arg.
Aside: I wasn't sure whether robust standard errors theoretically the same with and without absorb?
proofs of equivalence for the main cov_types in Ding, Peng. 2021. “The Frisch–Waugh–Lovell Theorem for Standard Errors.” Statistics & Probability Letters 168 (January): 108945. https://doi.org/10.1016/j.spl.2020.108945.
Basu has summary of Ding in section 6.1, including adjustments to degrees of freedom corrections However, this equivalence does not hold for variants like HC3, or for variants of Cluster robust and HAC see section 6.2 in Basu, Deepankar. 2023. “The Yule-Frisch-Waugh-Lovell Theorem.” arXiv. https://doi.org/10.48550/arXiv.2307.00369.
Thank you for your rapid feedback. I'm glad to see I was on the right track.
Sorry I didn't find #2799. It overlaps with this issue indeed — would you like me to close this one then?
I can see #2799 is attached to the 0.14
milestone. Is there some timeline attached to it?
possible solution define and return
df_resid
in_get_sandwich_arrays
, use model attribute if available, andxu.shape
for tuple arg.
I'd be happy to submit a PR with this. However, while I believe I could work my way around the codebase, I'm out of my depth on the stats theory and would still require some guidance. Is that something that you'd be interested in?
a PR should not be difficult because fixing the df part can be done with changes only in the sandwich_covariance module
i.e. replace all (nobs - k_params) by df_resid and get the df_resid as additional return from _get_sandwich_array. Then user modifying df_resid should work, and could be treated just as a bugfix.
Adding a ddof keyword to the regression models requires more work and would be a separate PR. I thought I had already a PR for it, but don't find it. Also that will need decisions what to do in cases like HC3 or CR3 where the standard errors are not the same after absorbing.
Bug(?)
I want to update the degrees of freedom of my model, to account for fixed effects I have demeaned beforehand.
The above updates the DF in the summary, but doesn't update the t-stats or interval of confidence as I would have expected.
Context
I am transitioning out of Stata, and would like to replicate the following OLS with 3+ fixed effects (possibly high-dimensional):
stdout
``` . reghdfe ln_wage hours tenure, absorb(age union occ_code) vce(cluster ind_code) (dropped 1 singleton observations) (MWFE estimator converged in 5 iterations) HDFE Linear regression Number of obs = 18,835 Absorbing 3 HDFE groups F( 2, 11) = 60.43 Statistics robust to heteroskedasticity Prob > F = 0.0000 R-squared = 0.3413 Adj R-squared = 0.3397 Within R-sq. = 0.0693 Number of clusters (ind_code) = 12 Root MSE = 0.3793 (Std. err. adjusted for 12 clusters in ind_code) ------------------------------------------------------------------------------ | Robust ln_wage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- hours | -.0006646 .0014636 -0.45 0.659 -.0038859 .0025566 tenure | .0276442 .0025217 10.96 0.000 .0220939 .0331944 _cons | 1.671454 .0620259 26.95 0.000 1.534936 1.807972 ------------------------------------------------------------------------------ Absorbed degrees of freedom: -----------------------------------------------------+ Absorbed FE | Categories - Redundant = Num. Coefs | -------------+---------------------------------------| age | 30 0 30 | union | 2 1 1 | occ_code | 13 1 12 ?| -----------------------------------------------------+ ? = number of redundant parameters may be higher ```I can do this by creating dummies for each fixed effect, and the results are consistent with Stata:
stdout
``` OLS Regression Results ============================================================================== Dep. Variable: ln_wage R-squared: 0.341 Model: OLS Adj. R-squared: 0.340 Method: Least Squares F-statistic: 5.347e+11 Date: Thu, 23 May 2024 Prob (F-statistic): 2.31e-63 Time: 15:46:42 Log-Likelihood: -8444.3 No. Observations: 18836 AIC: 1.698e+04 Df Residuals: 18790 BIC: 1.734e+04 Df Model: 45 Covariance Type: cluster ======================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------- Intercept 1.6619 0.038 43.558 0.000 1.578 1.746 <… dummy variable for each dimension of the fixed effects omitted…> hours -0.0007 0.001 -0.454 0.659 -0.004 0.003 tenure 0.0276 0.003 10.962 0.000 0.022 0.033 ============================================================================== Omnibus: 1577.253 Durbin-Watson: 1.040 Prob(Omnibus): 0.000 Jarque-Bera (JB): 5415.689 Skew: 0.400 Prob(JB): 0.00 Kurtosis: 5.502 Cond. No. 2.92e+04 ============================================================================== Notes: [1] Standard Errors are robust to cluster correlation (cluster) [2] The condition number is large, 2.92e+04. This might indicate that there are strong multicollinearity or other numerical problems. ```However, this isn't efficient for high-dimension fixed effects. Instead, I want to:
reghdfe
. The algorithm has already been ported in Python — cf.pyhdfe
.I understand, though, that degrees of freedom must be updated in the OLS model to account for those absorbed in the fixed effects.
Code sample
stdout
``` OLS Regression Results ============================================================================== Dep. Variable: ln_wage R-squared: 0.069 Model: OLS Adj. R-squared: 0.067 Method: Least Squares F-statistic: 582.6 Date: Thu, 23 May 2024 Prob (F-statistic): 3.65e-201 Time: 15:59:21 Log-Likelihood: -8444.4 No. Observations: 18835 AIC: 1.698e+04 Df Residuals: 18789 BIC: 1.734e+04 Df Model: 45 Covariance Type: cluster ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 8.687e-18 0.004 2.08e-15 1.000 -0.008 0.008 hours -0.0007 0.001 -1.226 0.221 -0.002 0.000 tenure 0.0276 0.001 33.494 0.000 0.026 0.029 ============================================================================== Omnibus: 1577.013 Durbin-Watson: 1.040 Prob(Omnibus): 0.000 Jarque-Bera (JB): 5414.228 Skew: 0.400 Prob(JB): 0.00 Kurtosis: 5.502 Cond. No. 9.15 ============================================================================== Notes: [1] Standard Errors are robust to cluster correlation (cluster) ```Expected Output
Updating the degree of freedom of the model via
simply modifies the reported
Df Residuals
andDf Model
. However, the t-stats, interval of confidence, R², etc. are the same as if I don't adjust them. I would expect them to be the same as when running the OLS with dummies:System