Open m-cahana opened 5 years ago
Thanks for paying close enough attention to catch this. It looks like a numerical precision problem caused by collinearity on the RHS. Is it possible that some of the firms are being drawn with just one or two wells, and those wells are being placed in grids that don't have wells from other firms? That would create collinearity, though I would have thought R would be smart enough to drop the collinear FEs before proceeding with the regression. (When you run it in Stata, do you see Stata drop variables before the regression? Stata is explicit when it does this.)
Within the mock dataset I don't think this would not be possible. The number of wells per firm is going to be around 1100, and in the few iterations of mock data that I created, the smallest firm still had close to 1000 wells. It's also very likely that grids in the mock data will contain wells from multiple firms, and in the mock dataset that I created, grids contain wells from 12 firms at the least.
Stata drops the Z:2015 interaction because of collinearity, but I would have thought R would know to make the same omission.
What happens if you tell R to estimate a model that is obviously collinear (i.e., test out a relatively small fake dataset with perfectly collinear (or nearly collinear?) regressors)? Does it explicitly drop variables, or does it behave like what we're seeing above?
If you estimate a model that's obviously collinear (the fake dataset I created, for instance, had variables y, x1, and x2, x2 being equal to 5 * x1), the felm
package does not explicitly drop variables, rather it attempts to estimate coefficients for all RHS variables and outputs the following warning message:
In sqrt(diag(z$STATS[[lhs]]$robustvcv)) : NaNs produced
However if you are to estimate the same model using the lm
package, the x2 variable is explicitly dropped because of singularity, as we'd expect.
A couple thoughts:
1) In the two original estimates you posted, the differences between the Z:year
coefficient estimates are actually the same across estimates. Since these differences are all we care about (the level is meaningless), we're actually OK(ish). Still, this problem will be annoying if we want to graph coefficients with standard errors, since the estimates in levels will still be moving around randomly every time we re-run the model.
2) See https://cran.r-project.org/web/packages/lfe/vignettes/identification.pdf, which discusses problems that happen with felm
with three dimensions of FE in which non-obvious collinearity problems can arise.
model <- felm(output ~ Z:year + year | firm + grid | 0 | firm, data = df, exactDOF=TRUE)
, which may run cleanly. (Or try the same thing with firm
rather than year
.)Yeah I think your read is right - running model <- felm(output ~ Z:year + year | firm + grid | 0 | firm, data = df, exactDOF=TRUE)
runs cleanly, and trying to pull out different FEs like firm
or grid
results in consistent regression output (although with warning messages).
However felm
still outputs different coefficient estimates than Stata would, and differences across estimates look different as well from what I can tell.
I was thinking it may be more convenient for us going forward to use the reghdfe
command in Stata to evaluate these regressions, I'm going to try and see if I can use the RStata package to communicate our dataset to Stata and return the result of a reghdfe regression back to R.
sounds good
When I evaluate a model multiple times using the felm function, different results, with different coefficients and standard errors, are outputted. I am evaluating the following model:
model <- felm(output ~ Z:year | firm + year + grid | 0 | firm, data = df, exactDOF=TRUE)
My architecture is x86_64, and my platform is x86_64-apple-darwin15.6.0. I am using R version 3.5.1, and lfe_2.8-2.
I found this error using real data, and found that it persists with mock data. Here is all the code necessary to reproduce the mock dataset df:
When I evaluated my felm model multiple times, I expected to see consistent output. Instead, evaluating the same formula multiple times produced different results. Sometimes, evaluating
model
will result in the following warning message and summary:However other times, evaluating the same exact
model
will result in a different warning message and summary:It may take more than two attempts to see different results, but if you evaluate
model
10 times, you should see results differ. Note that I attempted to run this code on a Windows computer with the same R version and lfe version, and encountered the same issue. I also ran this model in Stata/SE 15.1, and results were consistent across multiple tries, but different from the felm output.