lrberge / fixest

Fixed-effects estimations
https://lrberge.github.io/fixest/
361 stars 59 forks source link

fixef.rm = "singleton" drops smaller number of observations than reghdfe #478

Closed brendaprallon closed 4 months ago

brendaprallon commented 4 months ago

Hello! I have been banging my head against a wall with this for the past couple of days and can't figure out what is going on, so here I am. My apologies in advance because my reproducible example is not minimal; I really don't know what is driving this discrepancy. @lrberge I will DM you the data; for anyone else, it can be downloaded here.

I have been trying to replicate this paper in R using fixest. The authors use Stata's reghdfe. I use fixef.rm = TRUE to drop singletons, as reghdfe does by default. However, reghdfe reports deleting a higher number of singleton observations than feols.

Here is my code in R:

# Packages
library(readstata13)
library(fixest)

# Reading data 
data <- read.dta13("fullworking.dta", convert.factors = FALSE)

# Ensuring FE columns are factors
data$year <- as.factor(data$year)
data$wac <- as.factor(data$wac)
data$sicdiv <- as.factor(data$sicdiv)
data$jobn <- as.factor(data$jobn)
data$estid <-as.factor(data$estid)

# Regression
reg <- feols(lnhourlyc ~ merit | estid^jobn + year^jobn, data = data, fixef.rm = "singleton")
reg$nobs

NOTES: 1,238 observations removed because of NA values (LHS: 1,226, RHS: 1,226, Fixed-effects: 1,238).
       62,347 fixed-effect singletons were removed (62,290 observations, breakup: 62,267/80).
[1] 838364

Here is the code in Stata:

use fullworking.dta, clear
reghdfe lnhourlyc merit, absorb(i.estid#i.jobn i.year#i.jobn)

HDFE Linear regression                            Number of obs   =    838,351
Absorbing 2 HDFE groups                           F(   1, 680622) =     618.48
                                                  Prob > F        =     0.0000
                                                  R-squared       =     0.9353
                                                  Adj R-squared   =     0.9203
                                                  Within R-sq.    =     0.0009
                                                  Root MSE        =     0.0997

------------------------------------------------------------------------------
   lnhourlyc | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       merit |  -.0133461   .0005367   -24.87   0.000    -.0143979   -.0122943
       _cons |   2.668113   .0002896  9214.36   0.000     2.667545     2.66868
------------------------------------------------------------------------------

Absorbed degrees of freedom:
------------------------------------------------------+
  Absorbed FE | Categories  - Redundant  = Num. Coefs |
--------------+---------------------------------------|
   estid#jobn |    156475           0      156475     |
    year#jobn |      1496         243        1253     |
------------------------------------------------------+

As you can see, the difference is just of 13 observations, but it is there. Some final information:

> packageVersion("fixest")
[1] ‘0.11.2’
> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.5

Stata:

. version
version 18.0

Thank you very much, both for the attention to the annoying issue and for the fantastic package!

etiennebacher commented 4 months ago

To eliminate a potential cause, using the same parameters as reghdfe doesn't change the number of observations:

reg <- feols(
  lnhourlyc ~ merit | estid^jobn + year^jobn, data = data, fixef.rm = "singleton", 
  fixef.tol = 1e-08, 
  fixef.iter = 16000
)
reg$nobs

[1] 838364
lrberge commented 4 months ago

Hi Branda: no need to bang your head any further, the algorithms are simply different. My algo is less sophisticated than the one of reghdfe, it simply applies one pass on the data.

To be clear, imagine the following data: obs fe1 fe2
1 1 1
2 2 1
3 2 2
4 3 2
5 3 2
6 3 2
Here the first FE has one singleton (first obs.), while the second FE has no singleton. My algo does just one pass and removes the first observation, leading to: obs fe1 fe2
2 2 1
3 2 2
4 3 2
5 3 2
6 3 2

And I stop there. Note that following the removal of the first observation, now the second FE has a singleton (second obs). reghdfe's algorithm recursively removes singletons until there is no one left.

So it applies a second pass, leading to: obs fe1 fe2
3 2 2
4 3 2
5 3 2
6 3 2
And finally a third pass, ending with: obs fe1 fe2
4 3 2
5 3 2
6 3 2

Bottom line

The algorithms are different. I will make it clear in the docs that the algorithm does not apply recursion -- otherwise it creates confusion. In the future I will implement the recursive algorithm, but currently the implementation is really low level (and fast) and updating it is non trivial.

Thanks for the effort in creating the issue, that's really appreciated!

(And thanks for the words :-))

lrberge commented 4 months ago

For full disclosure, I have planned to completely overhaul the code used to prepare the fixed-effects since I have a better algorithm now. At that time I'll rewrite the singleton removal. But it's a lot of c++ work, so it won't be soon sorry.

Currently, to replicate reghdfe, unfortunately you have to remove the singletons "by hand" before the estimation. Sorry.

brendaprallon commented 4 months ago

@lrberge thank you so much for your very didactic answer! Makes perfect sense. I will make a note of that and still keep using fixest because I need to eventually bootstrap the code and the small differences don’t justify the computational time that would take on Stata. Which just makes the point again: awesome package :)

s3alfisc commented 4 months ago

Hi @lrberge, not sure if helpful or not, but @styfenschaer has implemented the iterative singleton procedure for pyfixest (in numba) and it's fairly fast =)

lrberge commented 4 months ago

@s3alfisc: thanks! Actually writing one out from scratch is easy, the problem is that I do many things at once (not just the singletons), and the tricky thing is fitting everything together.