Closed jnaidoo closed 5 months ago
Hi @jnaidoo , this looks like a multicollinearity error that fixest
does not seem to handle out of the box. It's not a numerical problem. The issue arises because you drop an entire factor level "Don't know", from the data set, but the level is still encoded in the factor variable:
df <- df %>% filter(my_factor != "Don't Know")
levels(df$my_factor)
# [1] "Don't Know" "No" "Yes"
table(df$my_factor)
# Don't Know No Yes
# 0 680706 1299390
You can drop all unused levels via the droplevels
function:
df <- droplevels(df)
levels(df$my_factor)
# [1] "No" "Yes"
feols(Y ~ my_factor, data = df)
# Standard-errors: IID
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2.002123 0.004847 413.040663 < 2.2e-16 ***
# my_factorYes 0.001696 0.005984 0.283435 0.77684
and then you get sensible results. In all likelihood, lm()
employs droplevels()
while fixest::feols()
does not.
I see! I did notice that I could not reproduce the error unless I included the third level of the factor (and then filtered it out). Thank you for the helpful response!
Jesse
Hi Jesse - I would actually suggest to keep this PR open because this might indeed be undesired behavior; and in case it is, @lrberge might lose track of it in case the PR is closed?
Hi everyone! Thanks Alex for answering, you pinpointed the issue wrt droplevels
.
Collinearity is a tricky topic. Collinear variables are detected, and removed on-the-fly, during the Cholesky decomposition. This is known to be much less numerically stable than a QR. However it's much faster, and it works OK most of the times.
In the example posted, beyond the droplevels
problem, this is a typical collinearity false negative. You can monitor collinearity detection with the argument collin.tol
, whose default, 1e-10, might be too low.
We can look at the threshold at which one variable would have been removed with the collin.min_norm
element:
est = feols(Y ~ my_factor, data = df %>% filter(my_factor != "Don't Know"))
est$collin.min_norm
#> [1] 1.164153e-10
Which is just above the threshold. Raising collin.tol
to 1e-8 does the job:
r$> feols(Y ~ my_factor, data = df %>% filter(my_factor != "Don't Know"), collin.tol = 1e-8)
The variable 'my_factorYes' has been removed because of collinearity (see $collin.var).
OLS estimation, Dep. Var.: Y
Observations: 1,979,799
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.001883 0.003507 570.832722 < 2.2e-16 ***
my_factorNo -0.004933 0.005982 -0.824718 0.40953
... 1 variable was removed because of collinearity (my_factorYes)
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 3.99741 Adj. R2: -1.616e-7
On droplevels, I will not add it built-in the software because it is too time consuming:
system.time(droplevels(df_small))
#> user system elapsed
#> 0.03 0.00 0.04
system.time(feols(Y ~ my_factor, data = df_small, collin.tol = 1e-8))
#> The variable 'my_factorYes' has been removed because of collinearity (see $collin.var).
#> user system elapsed
#> 0.36 0.03 0.17
It would increase estimation time by 25%!!!
Thanks all!
Hi fixest team
I am estimating a regression where the main coefficient of interest is a binary variable, which I treat as a factor. With a large enough dataset (at least 2 million observations), I am able to get
feols
to spit out coefficient estimates for both levels of the factor, even though there is also a constant in the regression.The standard errors are also very large and all the test statistics are inaccurate, so clearly this is a numerical linear algebra problem. However, the
lm
method of base R does not have the same problem at this scale. Example below.Here is the output of
feols
. As you can see the estimates should be 2 for the intercept and 0 for the coefficient onmy_factor
, but we get nonsense numbers with huge standard errors, and worse we get them for both levels of the factor variable.With these same data
lm()
correctly drops one of the levels, and returns non-spurious estimates:Here's the result of
sessionInfo()
in case it helps:Despite this complaint, I want to thank you for making such a useful package! Maybe this has something to do with the in-line filtering I did.
Perhaps the optimal thing to do here is nothing, and simply accept that there are limits to numerical precision, although it seems like this sort of problem should be well within the capabilities of a typical consumer-grade computer (in my case a 2019 Mac mini). Anyway, I hope raising this issue is helpful to you.
Sincerely, Jesse Naidoo