lrberge / fixest

Fixed-effects estimations
https://lrberge.github.io/fixest/
361 stars 59 forks source link

feols "inverts" large enough singular matrices #447

Closed jnaidoo closed 5 months ago

jnaidoo commented 8 months ago

Hi fixest team

I am estimating a regression where the main coefficient of interest is a binary variable, which I treat as a factor. With a large enough dataset (at least 2 million observations), I am able to get feols to spit out coefficient estimates for both levels of the factor, even though there is also a constant in the regression.

The standard errors are also very large and all the test statistics are inaccurate, so clearly this is a numerical linear algebra problem. However, the lm method of base R does not have the same problem at this scale. Example below.

library(tidyverse)
library(fixest)

set.seed(270494)
n = 2000000

my_factor = sample(-1:1, size = n, replace = TRUE, prob = c(0.01, 0.34, 0.65))
df <- as_tibble(my_factor) %>% 
  rename(my_factor = value) %>% 
  mutate(my_factor = as_factor(my_factor)) %>% 
  mutate(my_factor = fct_recode(my_factor, "Yes" = "1", "No" = "0", "Don't Know" = "-1")) %>% 
  mutate(Y = rnorm(n, mean = 2, sd = 4))

# we should get estimates of 2 for the intercept and 0 for the coefficient on "my_factor"
# feols will produce spurious estimates for both levels of "my_factor"
# ... despite being collinear with the constant

feols(Y ~ my_factor, data = df %>% filter(my_factor != "Don't Know"))

# lm() correctly drops one of the levels for "my_factor"

lm(Y ~ my_factor, data = df %>% filter(my_factor != "Don't Know")) %>% summary()

Here is the output of feols. As you can see the estimates should be 2 for the intercept and 0 for the coefficient on my_factor, but we get nonsense numbers with huge standard errors, and worse we get them for both levels of the factor variable.

> feols(Y ~ my_factor, data = df %>% filter(my_factor != "Don't Know"))
OLS estimation, Dep. Var.: Y
Observations: 1,979,799 
Standard-errors: IID 
             Estimate Std. Error   t value Pr(>|t|) 
(Intercept)      1616   414264.1  0.003901  0.99689 
my_factorNo     -1612   414264.1 -0.003891  0.99690 
my_factorYes    -1612   414264.1 -0.003891  0.99690 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 4.46974   Adj. R2: -0.250278

With these same data lm() correctly drops one of the levels, and returns non-spurious estimates:

> lm(Y ~ my_factor, data = df %>% filter(my_factor != "Don't Know")) %>% summary()

Call:
lm(formula = Y ~ my_factor, data = df %>% filter(my_factor != 
    "Don't Know"))

Residuals:
     Min       1Q   Median       3Q      Max 
-19.2704  -2.6967  -0.0006   2.7003  19.1789 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.996950   0.004846 412.109   <2e-16 ***
my_factorYes 0.004933   0.005982   0.825     0.41    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.997 on 1979797 degrees of freedom
Multiple R-squared:  3.436e-07, Adjusted R-squared:  -1.616e-07 
F-statistic: 0.6802 on 1 and 1979797 DF,  p-value: 0.4095

Here's the result of sessionInfo() in case it helps:

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin22.4.0 (64-bit)
Running under: macOS Ventura 13.5.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /usr/local/Cellar/r/4.3.1/lib/R/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Africa/Johannesburg
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fixest_0.11.1   lubridate_1.9.3 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.3     purrr_1.0.2    
 [7] readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.3   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] gtable_0.3.4        compiler_4.3.1      Rcpp_1.0.11         tidyselect_1.2.0    scales_1.2.1       
 [6] lattice_0.21-9      R6_2.5.1            generics_0.1.3      Formula_1.2-5       munsell_0.5.0      
[11] pillar_1.9.0        tzdb_0.4.0          rlang_1.1.1         utf8_1.2.3          stringi_1.7.12     
[16] timechange_0.2.0    cli_3.6.1           withr_2.5.1         magrittr_2.0.3      grid_4.3.1         
[21] rstudioapi_0.15.0   hms_1.1.3           sandwich_3.0-2      lifecycle_1.0.3     nlme_3.1-163       
[26] vctrs_0.6.3         glue_1.6.2          numDeriv_2016.8-1.1 zoo_1.8-12          fansi_1.0.5        
[31] colorspace_2.1-0    dreamerr_1.3.0      tools_4.3.1         pkgconfig_2.0.3    

Despite this complaint, I want to thank you for making such a useful package! Maybe this has something to do with the in-line filtering I did.

Perhaps the optimal thing to do here is nothing, and simply accept that there are limits to numerical precision, although it seems like this sort of problem should be well within the capabilities of a typical consumer-grade computer (in my case a 2019 Mac mini). Anyway, I hope raising this issue is helpful to you.

Sincerely, Jesse Naidoo

s3alfisc commented 8 months ago

Hi @jnaidoo , this looks like a multicollinearity error that fixest does not seem to handle out of the box. It's not a numerical problem. The issue arises because you drop an entire factor level "Don't know", from the data set, but the level is still encoded in the factor variable:

df <- df %>% filter(my_factor != "Don't Know")
levels(df$my_factor)
# [1] "Don't Know" "No"         "Yes"       
table(df$my_factor)
# Don't Know         No        Yes 
#          0     680706    1299390 

You can drop all unused levels via the droplevels function:

df <- droplevels(df)
levels(df$my_factor)
# [1] "No"  "Yes"
feols(Y ~ my_factor, data = df)
# Standard-errors: IID 
# Estimate Std. Error    t value  Pr(>|t|)    
# (Intercept)  2.002123   0.004847 413.040663 < 2.2e-16 ***
#   my_factorYes 0.001696   0.005984   0.283435   0.77684    

and then you get sensible results. In all likelihood, lm() employs droplevels() while fixest::feols() does not.

jnaidoo commented 8 months ago

I see! I did notice that I could not reproduce the error unless I included the third level of the factor (and then filtered it out). Thank you for the helpful response!

Jesse

s3alfisc commented 8 months ago

Hi Jesse - I would actually suggest to keep this PR open because this might indeed be undesired behavior; and in case it is, @lrberge might lose track of it in case the PR is closed?

lrberge commented 5 months ago

Hi everyone! Thanks Alex for answering, you pinpointed the issue wrt droplevels.

Collinearity is a tricky topic. Collinear variables are detected, and removed on-the-fly, during the Cholesky decomposition. This is known to be much less numerically stable than a QR. However it's much faster, and it works OK most of the times.

In the example posted, beyond the droplevelsproblem, this is a typical collinearity false negative. You can monitor collinearity detection with the argument collin.tol, whose default, 1e-10, might be too low. We can look at the threshold at which one variable would have been removed with the collin.min_norm element:

est = feols(Y ~ my_factor, data = df %>% filter(my_factor != "Don't Know"))
est$collin.min_norm
#> [1] 1.164153e-10

Which is just above the threshold. Raising collin.tol to 1e-8 does the job:

r$> feols(Y ~ my_factor, data = df %>% filter(my_factor != "Don't Know"), collin.tol = 1e-8)
The variable 'my_factorYes' has been removed because of collinearity (see $collin.var).
OLS estimation, Dep. Var.: Y
Observations: 1,979,799
Standard-errors: IID
             Estimate Std. Error    t value  Pr(>|t|)
(Intercept)  2.001883   0.003507 570.832722 < 2.2e-16 ***
my_factorNo -0.004933   0.005982  -0.824718   0.40953
... 1 variable was removed because of collinearity (my_factorYes)
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 3.99741   Adj. R2: -1.616e-7

droplevels

On droplevels, I will not add it built-in the software because it is too time consuming:

system.time(droplevels(df_small))
#>   user  system elapsed 
#>   0.03    0.00    0.04

system.time(feols(Y ~ my_factor, data = df_small, collin.tol = 1e-8))
#> The variable 'my_factorYes' has been removed because of collinearity (see $collin.var).
#>    user  system elapsed 
#>    0.36    0.03    0.17

It would increase estimation time by 25%!!!

Thanks all!