felm outputs different results when evaluating the same formula multiple times

m-cahana commented 5 years ago

When I evaluate a model multiple times using the felm function, different results, with different coefficients and standard errors, are outputted. I am evaluating the following model:

model <- felm(output ~ Z:year | firm + year + grid | 0 | firm, data = df, exactDOF=TRUE)

My architecture is x86_64, and my platform is x86_64-apple-darwin15.6.0. I am using R version 3.5.1, and lfe_2.8-2.

I found this error using real data, and found that it persists with mock data. Here is all the code necessary to reproduce the mock dataset df:

library(tidyverse)
library(lfe)

#===========
# set up parameters
#===========

lambda_N <- 1127.9
lambda_Z <- 1182
p <- 12/88 

firms <- seq(1,88)
grids <- seq(1,3324)
years <- seq(2005,2015)

lambda_output <- 85086

#===========
# create mock data
#===========

df <- 
    data.frame(firm=integer(),
                 Z=double(), 
                 grid=integer(), 
                 year=integer(), 
                 output = double()) 

for (i in 1:length(firms)) {
    N <- rpois(1, lambda_N)
    random_p <- sample(1:88,  1, prob=rep(1/88, 88))
    if  (random_p<=(p*88)) {
        Z <- 0
    } else {
        Z <- rpois(1, lambda_Z)
    }
    firm <- firms[i]
    print(i)

    for (well in 1:N) {
        grid <- sample(grids, 1, prob=rep(1/3324, 3324))
        year <- sample(years, 1, prob=rep(1/11, 11))
        output <- rpois(1, lambda_output) 
        row <- data.frame(firm = firm, 
            Z = Z, 
            grid = grid, 
            year = year, 
            output = output)
        df <- df %>% bind_rows(row)
    }
}

df <- 
    df %>% 
    mutate(year = as.factor(year))

When I evaluated my felm model multiple times, I expected to see consistent output. Instead, evaluating the same formula multiple times produced different results. Sometimes, evaluating model will result in the following warning message and summary:

> model <- felm(output ~ Z:year | firm + year + grid | 0 | firm, data = df, exactDOF=TRUE)
Warning message:
In sqrt(diag(z$STATS[[lhs]]$clustervcv)) : NaNs produced
> summary(model)

Call:
   felm(formula = output ~ Z:year | firm + year + grid | 0 | firm,      data = df, exactDOF = TRUE) 

Residuals:
    Min      1Q  Median      3Q     Max 
-1366.5  -194.6     1.2   192.0  1180.7 

Coefficients:
             Estimate Cluster s.e. t value Pr(>|t|)
Z:year2005 -1.902e-02           NA      NA       NA
Z:year2006 -1.766e-02           NA      NA       NA
Z:year2007 -1.034e-02           NA      NA       NA
Z:year2008 -1.842e-02           NA      NA       NA
Z:year2009 -1.220e-02           NA      NA       NA
Z:year2010 -4.081e-03           NA      NA       NA
Z:year2011 -6.388e-03           NA      NA       NA
Z:year2012 -3.385e-02           NA      NA       NA
Z:year2013  9.482e-05           NA      NA       NA
Z:year2014 -3.577e-02           NA      NA       NA
Z:year2015 -2.290e-02           NA      NA       NA

Residual standard error: 292.2 on 95813 degrees of freedom
Multiple R-squared(full model): 0.03559   Adjusted R-squared: 0.001058 
Multiple R-squared(proj model): 0.0001241   Adjusted R-squared: -0.03568 
F-statistic(full model, *iid*):1.031 on 3431 and 95813 DF, p-value: 0.1071 
F-statistic(proj model): 0.1735 on 11 and 87 DF, p-value: 0.9986

However other times, evaluating the same exact model will result in a different warning message and summary:

> model <- felm(output ~ Z:year | firm + year + grid | 0 | firm, data = df, exactDOF=TRUE)
Warning message:
In chol.default(mat, pivot = TRUE, tol = tol) :
  the matrix is either rank-deficient or indefinite
> summary(model)

Call:
   felm(formula = output ~ Z:year | firm + year + grid | 0 | firm,      data = df, exactDOF = TRUE) 

Residuals:
    Min      1Q  Median      3Q     Max 
-1366.5  -194.6     1.2   192.0  1180.7 

Coefficients:
            Estimate Cluster s.e. t value Pr(>|t|)  
Z:year2005 -0.006822     0.013123  -0.520   0.6032  
Z:year2006 -0.005463     0.013488  -0.405   0.6855  
Z:year2007  0.001857     0.011401   0.163   0.8706  
Z:year2008 -0.006221     0.015386  -0.404   0.6860  
Z:year2009        NA     0.000000      NA       NA  
Z:year2010  0.008115     0.010147   0.800   0.4239  
Z:year2011  0.005808     0.011648   0.499   0.6180  
Z:year2012 -0.021655     0.011238  -1.927   0.0540 .
Z:year2013  0.012291     0.011773   1.044   0.2965  
Z:year2014 -0.023575     0.012327  -1.913   0.0558 .
Z:year2015 -0.010704     0.007292  -1.468   0.1421  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 292.2 on 95814 degrees of freedom
Multiple R-squared(full model): 0.03559   Adjusted R-squared: 0.001069 
Multiple R-squared(proj model): 0.0001241   Adjusted R-squared: -0.03567 
F-statistic(full model, *iid*):1.031 on 3430 and 95814 DF, p-value: 0.1049 
F-statistic(proj model): 7.716 on 11 and 87 DF, p-value: 3.787e-09

It may take more than two attempts to see different results, but if you evaluate model 10 times, you should see results differ. Note that I attempted to run this code on a Windows computer with the same R version and lfe version, and encountered the same issue. I also ran this model in Stata/SE 15.1, and results were consistent across multiple tries, but different from the felm output.

kelloggrk commented 5 years ago

Thanks for paying close enough attention to catch this. It looks like a numerical precision problem caused by collinearity on the RHS. Is it possible that some of the firms are being drawn with just one or two wells, and those wells are being placed in grids that don't have wells from other firms? That would create collinearity, though I would have thought R would be smart enough to drop the collinear FEs before proceeding with the regression. (When you run it in Stata, do you see Stata drop variables before the regression? Stata is explicit when it does this.)

m-cahana commented 5 years ago

Within the mock dataset I don't think this would not be possible. The number of wells per firm is going to be around 1100, and in the few iterations of mock data that I created, the smallest firm still had close to 1000 wells. It's also very likely that grids in the mock data will contain wells from multiple firms, and in the mock dataset that I created, grids contain wells from 12 firms at the least.

Stata drops the Z:2015 interaction because of collinearity, but I would have thought R would know to make the same omission.

kelloggrk commented 5 years ago

What happens if you tell R to estimate a model that is obviously collinear (i.e., test out a relatively small fake dataset with perfectly collinear (or nearly collinear?) regressors)? Does it explicitly drop variables, or does it behave like what we're seeing above?

m-cahana commented 5 years ago

If you estimate a model that's obviously collinear (the fake dataset I created, for instance, had variables y, x1, and x2, x2 being equal to 5 * x1), the felm package does not explicitly drop variables, rather it attempts to estimate coefficients for all RHS variables and outputs the following warning message:

In sqrt(diag(z$STATS[[lhs]]$robustvcv)) : NaNs produced

However if you are to estimate the same model using the lm package, the x2 variable is explicitly dropped because of singularity, as we'd expect.

kelloggrk commented 5 years ago

A couple thoughts: 1) In the two original estimates you posted, the differences between the Z:year coefficient estimates are actually the same across estimates. Since these differences are all we care about (the level is meaningless), we're actually OK(ish). Still, this problem will be annoying if we want to graph coefficients with standard errors, since the estimates in levels will still be moving around randomly every time we re-run the model. 2) See https://cran.r-project.org/web/packages/lfe/vignettes/identification.pdf, which discusses problems that happen with felm with three dimensions of FE in which non-obvious collinearity problems can arise.

My read is that R is much more likely to handle this estimator well if we pull one of the FE dimensions into the X variables and out of the FE. For instance, you can try running model <- felm(output ~ Z:year + year | firm + grid | 0 | firm, data = df, exactDOF=TRUE), which may run cleanly. (Or try the same thing with firm rather than year.)

m-cahana commented 5 years ago

Yeah I think your read is right - running model <- felm(output ~ Z:year + year | firm + grid | 0 | firm, data = df, exactDOF=TRUE) runs cleanly, and trying to pull out different FEs like firm or grid results in consistent regression output (although with warning messages).

However felm still outputs different coefficient estimates than Stata would, and differences across estimates look different as well from what I can tell.

I was thinking it may be more convenient for us going forward to use the reghdfe command in Stata to evaluate these regressions, I'm going to try and see if I can use the RStata package to communicate our dataset to Stata and return the result of a reghdfe regression back to R.

kelloggrk commented 5 years ago

sounds good

sgaure / lfe

felm outputs different results when evaluating the same formula multiple times #11