statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
9.98k stars 2.87k forks source link

cluster robust inference, degrees of freedom #1201

Open josef-pkt opened 10 years ago

josef-pkt commented 10 years ago

issue specifically about degrees of freedom and use_t more general issue #1099

cluster and panel robust standard errors "Which ones are we talking about?"

Cameron, A. Colin, and Douglas L. Miller. "Robust inference with clustered data." Handbook of empirical economics and finance (2010): 1-28. working paper version http://www.econstor.eu/bitstream/10419/58373/1/635883198.pdf

roughly: with large number of cluster (G of them) asymptotics and variables that don't vary within cluster (L of them), the degrees of freedom is G-L. Stata uses G-1 using t-distribution with small df works much better in small G case, then normal distribution. https://github.com/statsmodels/statsmodels/pull/1189#issuecomment-29026705

Not clear to me: Inference for variables where we do have a lot of within variation. There we still have number of observations per cluster (N_g or T) to provide information. both N and T -> inf see also Christian B. Hansen, Asymptotic properties of a robust variance matrix estimator for panel data when is large, Journal of Econometrics, Volume 141, Issue 2, December 2007, Pages 597-620, ISSN 0304-4076, http://dx.doi.org/10.1016/j.jeconom.2006.10.009. (http://www.sciencedirect.com/science/article/pii/S0304407606002089)

under mixing conditions we don't need full cluster robust (we are using HAC kernels) convergence rate is sqrt(N*T)

mentioned in Cameron/Miller Under difference-in-difference with few treated clusters, the degrees of freedom should be determined by the number of treated clusters not by the total number of clusters. If we want to support this, then we need an extra option because we don't know what the treated clusters are.

same question for all three version of panel/cluster robust sandwiches that we have right now.

I will add Stata's df = G - 1 as default, so I can finish up the unit tests. ivreg2 with cluster uses normal without df (which I also get with use_t=False)

josef-pkt commented 10 years ago

as reference: ivreg2 allows for two-way clustering http://www.stata.com/statalist/archive/2010-10/msg01037.html

check also Peterson again, was my reference for two-way cluster robust standard errors

update

my Stata packages were too old, and didn't have 2-way cluster option yet

ssc install ivreg2, replace
ssc install ranktest, replace
josef-pkt commented 10 years ago

still missing: Where did I read about CR1, CR2, CR3 small sample corrections for cluster robust, corresponding to HC1, HC2, HC3?

josef-pkt commented 10 years ago

more options in Stata ?

doesn't match up with my nw-groupsum or the xtscc

ivreg2 invest mvalue kstock, bw(5) cluster(year) small
(6) bw(#) combined with cluster(varname) is allowed with either 1- or 2-level clustering if the data are
panel data that are tsset on the time variable varname.  Following Driscoll and Kray (1998), the SEs and
statistics reported will be robust to disturbances that are common to panel units and that are persistent,
i.e., autocorrelated.

this uses df=T-1=19 while xtscc uses G-1

without small

ivreg2 invest mvalue kstock, bw(5) cluster(year)

I get the same standard errors with the following, but I have df=G-1=9, ivreg2 has df=T-1=19

>>> res_cl2t = res.get_robustcov_results(cov_type='nw-groupsum', time=time, maxlags=4, use_correction=False, use_t=True)
>>> res_cl2t.bse
array([  0.01343602,   0.04930801,  12.19034839])

Should I match up df with the use_correction? use_correction='hac' means we have time series and use df=T-1=19 use_correction='cluster' means we have G->inf and use df=G-1=9

josef-pkt commented 10 years ago

to make it more "fun":

ivreg2 invest mvalue kstock, dkraay(5)
xtscc invest mvalue kstock, lag(4)

report the same standard errors but with the same difference in degrees of freedom 19 versus 9, with the corresponding differences in confidence intervals

josef-pkt commented 10 years ago

Kit Baum is also not sure about Stata's choice of some df's: areg versus xtreg fe http://www.stata.com/statalist/archive/2010-03/msg00941.html

josef-pkt commented 7 years ago

I'm adding here another small sample correction reference, including df adjustment in the style of Satterthwaite (I guess).

Imbens, Guido W., and Michal Kolesár. 2015. “Robust Standard Errors in Small Samples: Some Practical Advice.” The Review of Economics and Statistics 98 (4): 701–12. doi:10.1162/REST_a_00552. (there has been a working paper in circulation for a while, but IIRC I didn't read it.)

has something like CR2 or LZ, cluster analog to HC2 code in R but under MIT license https://github.com/kolesarm/Robust-Small-Sample-Standard-Errors (replication files for published article doesn't have a license.)