Open grantmcdermott opened 4 years ago
After some profiling and debugging, I think I've identified the culprit section of code (part of the internal makematrix()
function): https://github.com/sgaure/lfe/blob/master/R/felm.R#L187-L191
The TL;DR version is that this part of the code replaces reference levels for the FE factors with NA. However, when two or more factors are fed through the subsequent interaction()
call, then any interaction that contains a reference level for at least one of the parent factors is coerced to NA too.
Example: Say we have two factor-based vectors f1 = [1,2,3,4,5] and f2 = [a,b,c,d,e] that have reference levels "1" and "a", respectively. If we interact them, then we'd ideally want only a single reference case, e.g. "1.a". But what's happening at the moment is that "1.b", "1.c''... "2a", "3a" etc are all getting coded as the reference level too because they contain either "1" or "a".
The bottom line, as far as I can tell, is that we end up with a lot of "false" reference cases that later cause a bottleneck when passed to the key demeanlist()
function.
PR coming shortly.
Rerunning the above example with my PR branch:
devtools::load_all('~/Documents/Projects/lfe') ## feexp branch
microbenchmark(est1(), est2(), est3(), times = 1)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> est1() 1135.9062 1135.9062 1135.9062 1135.9062 1135.9062 1135.9062 1
#> est2() 615.8741 615.8741 615.8741 615.8741 615.8741 615.8741 1
#> est3() 816.3406 816.3406 816.3406 816.3406 816.3406 816.3406 1
Created on 2020-10-07 by the reprex package (v0.3.0)
Jumps around a bit, but I'm generally seeing a 15-25x improvement for this small(ish) example.
FWIW, I've also checked the output and it's the same, both among the three models and across my branch and the CRAN version version. I'd appreciated others kicking the tires, though.
Here's a gist to test the functionality with a real-world data set. Things look fine (although admittedly this is my very first time using lfe
). Unfortunately, the data is person-by-year so I can't do the multiplicative syntax.
I can confirm that @grantmcdermott's fix works for me. I ran a regression with 270,000 observations, 600 clusters, and about 10,000 fixed effects on my Windows computer. Most recent version of lfe
available on CRAN had runtime of 5 minutes. Grant's updated version has runtime of 2.85 seconds. For comparison, reghdfe
has runtime of 3.06 seconds on Stata MP-6.
Just came across this issue (thanks to @reifjulian for the prompt).
The TL;DR version is that using interaction term expansion --- i.e.
f1*f2
, or evenf1:f2
--- in the FE slot causes a major slowdown. The latter is faster than the former, but still significantly slower than creating the interaction outside of thefelm()
call.In the reprex below, I'm using an IV regression (adapted from the docs) since that's the use-case we've been troubleshooting. But I've tested a non-IV example and the effect is the same. From my limited testing, the relative disparities also appear to increase as the data get bigger.
PS.
felm()
documentation warns users not to use*
expansion in the FE slot. But AFAIK this only applies in cases where both variables have not been specified as factors.Again, the first two cases with internal expansion (especially est1) are much slower than est3, which creates the interaction outside of the
felm()
call.And just to confirm that they're yielding the same output:
Created on 2020-10-05 by the reprex package (v0.3.0)