sgaure / lfe

Source code repository for the R package lfe on CRAN.
53 stars 18 forks source link

Interaction expansion causes major slowdown #35

Open grantmcdermott opened 4 years ago

grantmcdermott commented 4 years ago

Just came across this issue (thanks to @reifjulian for the prompt).

The TL;DR version is that using interaction term expansion --- i.e. f1*f2, or even f1:f2 --- in the FE slot causes a major slowdown. The latter is faster than the former, but still significantly slower than creating the interaction outside of the felm() call.

In the reprex below, I'm using an IV regression (adapted from the docs) since that's the use-case we've been troubleshooting. But I've tested a non-IV example and the effect is the same. From my limited testing, the relative disparities also appear to increase as the data get bigger.

PS. felm() documentation warns users not to use * expansion in the FE slot. But AFAIK this only applies in cases where both variables have not been specified as factors.

library(lfe)
#> Loading required package: Matrix
library(broom)
library(microbenchmark)

## Create the dataset (riffing off of the felm help documentation)

set.seed(141556)
n = 2e5
# Covariates
x1 = rnorm(n)
x2 = rnorm(n)
# Individuals and firms
id = factor(sample(200,n,replace=TRUE))
firm = factor(sample(130,n,replace=TRUE))
# Effects for them
id.eff = rnorm(nlevels(id))
firm.eff = rnorm(nlevels(firm))
# Left hand side
u = rnorm(n)
y = x1 + 0.5*x2 + id.eff[id] + firm.eff[firm] + u
# Example with 'reverse causation' (IV regression)
# Q and W are instrumented by x3 and the factor x4.
x3 = rnorm(n)
x4 = sample(12,n,replace=TRUE)
Q = 0.3*x3 + x1 + 0.2*x2 + id.eff[id] + 0.3*log(x4) - 0.3*y + rnorm(n,sd=0.3)
W = 0.7*x3 - 2*x1 + 0.1*x2 - 0.7*id.eff[id] + 0.8*cos(x4) - 0.2*y+ rnorm(n,sd=0.6)
# Add them to the outcome variable
y = y + Q + W

## Create interaction exansion outside of the felm call
idfirm = factor(paste0(id, '_', firm))

## Temp functions for running the models (useful for microbenchmark)

est1 = function() felm(y ~ x1 + x2 | id * firm | (Q|W ~ x3 + factor(x4)) | id)
est2 = function() felm(y ~ x1 + x2 | id + firm + id:firm | (Q|W ~ x3 + factor(x4)) | id)
est3 = function() felm(y ~ x1 + x2 | id + firm + idfirm | (Q|W ~ x3 + factor(x4)) | id)

## Benchmark

microbenchmark(est1(), est2(), est3(), times = 1)
#> Unit: milliseconds
#>    expr       min        lq      mean    median        uq       max neval
#>  est1() 15284.549 15284.549 15284.549 15284.549 15284.549 15284.549     1
#>  est2()  8296.269  8296.269  8296.269  8296.269  8296.269  8296.269     1
#>  est3()   682.677   682.677   682.677   682.677   682.677   682.677     1

Again, the first two cases with internal expansion (especially est1) are much slower than est3, which creates the interaction outside of the felm() call.

And just to confirm that they're yielding the same output:

all.equal(tidy(est1()), tidy(est2()))
#> [1] TRUE
all.equal(tidy(est1()), tidy(est3()))
#> [1] TRUE

Created on 2020-10-05 by the reprex package (v0.3.0)

grantmcdermott commented 4 years ago

After some profiling and debugging, I think I've identified the culprit section of code (part of the internal makematrix() function): https://github.com/sgaure/lfe/blob/master/R/felm.R#L187-L191

The TL;DR version is that this part of the code replaces reference levels for the FE factors with NA. However, when two or more factors are fed through the subsequent interaction() call, then any interaction that contains a reference level for at least one of the parent factors is coerced to NA too.

Example: Say we have two factor-based vectors f1 = [1,2,3,4,5] and f2 = [a,b,c,d,e] that have reference levels "1" and "a", respectively. If we interact them, then we'd ideally want only a single reference case, e.g. "1.a". But what's happening at the moment is that "1.b", "1.c''... "2a", "3a" etc are all getting coded as the reference level too because they contain either "1" or "a".

The bottom line, as far as I can tell, is that we end up with a lot of "false" reference cases that later cause a bottleneck when passed to the key demeanlist() function.

PR coming shortly.

grantmcdermott commented 4 years ago

Rerunning the above example with my PR branch:

devtools::load_all('~/Documents/Projects/lfe')  ## feexp branch

microbenchmark(est1(), est2(), est3(), times = 1)
#> Unit: milliseconds
#>    expr       min        lq      mean    median        uq       max neval
#>  est1() 1135.9062 1135.9062 1135.9062 1135.9062 1135.9062 1135.9062     1
#>  est2()  615.8741  615.8741  615.8741  615.8741  615.8741  615.8741     1
#>  est3()  816.3406  816.3406  816.3406  816.3406  816.3406  816.3406     1

Created on 2020-10-07 by the reprex package (v0.3.0)

Jumps around a bit, but I'm generally seeing a 15-25x improvement for this small(ish) example.

FWIW, I've also checked the output and it's the same, both among the three models and across my branch and the CRAN version version. I'd appreciated others kicking the tires, though.

tyleransom commented 4 years ago

Here's a gist to test the functionality with a real-world data set. Things look fine (although admittedly this is my very first time using lfe). Unfortunately, the data is person-by-year so I can't do the multiplicative syntax.

reifjulian commented 4 years ago

I can confirm that @grantmcdermott's fix works for me. I ran a regression with 270,000 observations, 600 clusters, and about 10,000 fixed effects on my Windows computer. Most recent version of lfe available on CRAN had runtime of 5 minutes. Grant's updated version has runtime of 2.85 seconds. For comparison, reghdfe has runtime of 3.06 seconds on Stata MP-6.