Erroneous reporting of multicolinearity

truan commented 5 years ago

Hi, I notice that reghdfe reports multicolinearity problem when there does not seem to be one. The analysis I run is a difference in differences specification. When I have 3 levels of fixed effects - individual interacted with quarter, weekday, and month, i.e. reghdfe depvar treated_post, absorb(cid#yyyyqq weekday yyyymm), the estimation goes through without an issue. When I relax the 1st level of FE to be individiual, i.e. reghdfe depvar treated_post, absorb(cid weekday yyyymm), I get the following warning even after I decrease the tolerance:

warning: depvar might be perfectly explained by fixed effects (tol = 1.0e-29) note: treated_post is probably collinear with the fixed effects (all values close to zero after partialling-out; tol = 1.0e-29)

This does not sound right. If I use the areg command instead (areg depvar treated_post i.weekday i.yyyymm, a(cid)), I get no issue. I also try the demeaning the variables myself and find that not all values are close to zero after partialling-out. Is there an error?

I am using your latest version available from GitHub, "version 5.2.10 29aug2018".

sergiocorreia commented 5 years ago

(Caveat emptor: The comments below are based on my limited knowledge of your dataset, and of the details of your computer and Stata version)

The warnings you got appear when the depvar or the regressors are perfectly or almost-perfectly explained by the fixed effects.

In particular, as coded here, given a variable y and the residuals e of regressing it against all the fixed effects, reghdfe will flag y if the ratio e'e / y'y is very close to zero. Specifically, I flag a variable if its ratio is below 0.1 * tolerance, and the reason why I set it to be slightly below tolerance is because of what tolerance means: that numerical errors are allowed up to that value (roughly speaking...).

Now, from your comment, it seems you took the tolerance all the way to 1e-29. The problem is that reghdfe is usually only accurate up to around 1e-14 (a bit lower or higher depending on the dataset). This number is bounded by the numerical accuracy of your own computer, combined with that of Stata's. In particular, it's bounded by the machine epsilon, around 1e-16, and which can be seen in Stata if you type display epsdouble().

Now, the problem is that all bets are off if you use tolerances smaller than 1e-16, because Stata can't achieve that. Usually what happens is that reghdfe will work forever without converging, but in your case it seems that it converges (perhaps because of the perfect correlation??)

Further, because the criteria for when to drop a variable is linked to tolerance, the variable might not get dropped if you set very small tolerances (that's one of the reasons why the warnings I gave were all preceded by a "maybe").

You can see an example of the issue here:

sysuse auto, clear
reghdfe turn, a(trunk turn)
reghdfe turn, a(trunk turn) tol(1e-100)

Finally, there's still one nagging question: why did the code converge if you set such a small tolerance? I suspect the answer lies in this line: if the error gets too close to machine tolerance, the default solver will assume convergence and stop.

Now, what can you do? I would start with two things:

First try with a tolerance of 1e-10
If that fails, try regressing the culprit variable against all the fixed effects, and nothing else. If the R2 is 1.0, then you know you have a collinearity issue.
If none of the above works, try making a toy dataset where I can replicate the problem, and I can try to debug it.

Best, Sergio

truan commented 5 years ago

Hi Sergio,

Thanks for your detailed answer. I tried with the tolerance of 1e-10 and 1e-14 (my machine epsilon is 2.220e-16) and got no luck. Next I tried your second suggestion: reghdfe depvar, absorb(cid weekday yyyymm) and reghdfe treated_post, absorb(cid weekday yyyymm). Both reported R2 of 1. But here are some things I don't get:

`reghdfe treated_post, absorb(cid#yyyyqq weekday yyyymm)' does not give me a R2 of 1. How using a fixed effect for each individual results in a perfect R2 while allowing a bunch of different fixed effects for the same individual, one for each quarter, results in a R2 less than 1?
areg spendingamt_win_filled i.weekday i.yyyymm, absorb(cid) reports a R2 less than 1.

I am happy to come up with a MWE. How should I share it with you?

sergiocorreia commented 5 years ago

You can send it to sergio.correia@gmail.com (or sergio.a.correia@frb.gov , although the former is easier for me).

I'm also unsure what can be the issue, as I haven't heard of anything like it.

truan commented 5 years ago

Hi Sergio, I have sent you a MWE. Thanks a lot!

sergiocorreia commented 5 years ago

One question? Was that dataset somehow automatically generated by a non-Stata tool? I played a bit and I'm not entirely sure, but if you do

rename cid_of_match index

You solve the issue. It kinda feels as if the dataset appears sorted while not actually being sorted. Also, having a float as the sort index is a bad idea, but it should be unrelated. Note that you also solve the issue if instead of renaming the variable you delete it.

Now, reghdfe depends on ftools, which tries to speed up some things if it finds that the dataset is already presorted. So I'm guessing that whatever problem might come due to this.

Finally, this problem somehow made stata end up with all missing values, so when it computed the ratio above it got it to be zero. I added a check in a new version of reghdfe that will raise an error if there are missing values in the partialled-out variables, to prevent any instance of this happening again.

All in all, I'm not sure if the issue is with the dataset being incorrectly marked as sorted, or if the issue is somewhere else, perhaps with ftools. The update to reghdfe should prevent any cases of this failing silently in the future, but it would be good to know the underlying cause.

truan commented 5 years ago

Hi Sergio,

Thanks for looking into this and adding a check. The data I sent you was generated by Stata. I sorted the variable cid_of_match in one step when I transformed my real data into the toy dataset. I had this sorting step in my actual analysis to generate some variables I use.

I played a bit and found that if I sort my individual identifier cid by either Stata's regular sort or hashsort from gtools before running reghdfe depvar treated_post, absorb(cid weekday yyyymm), or if I generate another float individual identifier with egen group and use it as the absorbing variable in lieu of cid, the estimation went through. Based on these obversations, I don't think having a float index is the cause of this problem.

sergiocorreia commented 5 years ago

Hi Truan,

Two quick questions: what version of Stata do you use, and does the error still occur with the newest version of reghdfe?

Thanks!

truan commented 5 years ago

Hi,

I tried the latest version (version 5.3.2 30nov2018) on two different versions of Stata (Stata/MP 15.1 2-core and Stata/SE 15.1, both for Windows (64-bit x86-64)). The same problem still existed.

sergiocorreia / reghdfe

Erroneous reporting of multicolinearity #153