kylebutts / did2s

Two-stage Difference-in-Differences package following Gardner (2021)
http://kylebutts.github.io/did2s
Other
96 stars 22 forks source link

Speedup for `did2s()` #23

Closed etiennebacher closed 1 year ago

etiennebacher commented 1 year ago

Hello, thanks a lot for this package! I saw your tweet on the large performance improvement you made and I thought maybe there were more to do.

I noticed that some lines in make_g() are called several times while once would be enough. Calling these lines only once can lead to a ~1.5x speedup in some cases. Below is a benchmark with a made-up dataset that should be similar to df_het (but larger). The function call is the same as in the README.

I don't know your preferences regarding the DESCRIPTION and NEWS so either let me know or feel free to commit directly in this branch.

Benchmark

Setup

suppressPackageStartupMessages(library(did2s))

foo <- list()
for (i in 1:2500) {
    g <- paste0("Group ", sample(1:3, 1))
    start_rel <- sample(c((-20):(-1), Inf), 1)
    if (is.infinite(start_rel)) {
        rel_year <- rep(Inf, 31)
    } else {
        rel_year <- seq(from = start_rel, by = 1, length.out = 31)
    }

    foo[[i]] <- data.frame(
        unit = rep(i, 31),
        state = rep(i+10000, 31),
        group = rep(g, 31),
        rel_year = rel_year,
        dep_var = rnorm(31),
        year = 1990:2020
    ) 
    foo[[i]]$treat <- as.numeric(!is.infinite(foo[[i]]$rel_year) & foo[[i]]$rel_year > 0)
}

dat <- data.table::rbindlist(foo)

head(dat)
#>    unit state   group rel_year    dep_var year treat
#> 1:    1 10001 Group 2      -17 -0.5553937 1990     0
#> 2:    1 10001 Group 2      -16  1.6351042 1991     0
#> 3:    1 10001 Group 2      -15  0.1172385 1992     0
#> 4:    1 10001 Group 2      -14  1.3107047 1993     0
#> 5:    1 10001 Group 2      -13 -2.3333151 1994     0
#> 6:    1 10001 Group 2      -12 -0.9034240 1995     0

Before

bench::mark(
    did2s = did2s(dat,
          yname = "dep_var", first_stage = ~ 0 | state + year, 
          second_stage = ~i(treat, ref=FALSE), treatment = "treat", 
          cluster_var = "state")
)
#> Running Two-stage Difference-in-Differences
#>  - first stage formula `~ 0 | state + year`
#>  - second stage formula `~ i(treat, ref = FALSE)`
#>  - The indicator variable that denotes when treatment is on is `treat`
#>  - Standard errors will be clustered by `state`
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 did2s         13.3s    13.3s    0.0754    7.71GB     9.20

After

bench::mark(
    did2s = did2s(dat,
          yname = "dep_var", first_stage = ~ 0 | state + year, 
          second_stage = ~i(treat, ref=FALSE), treatment = "treat", 
          cluster_var = "state")
)
#> Running Two-stage Difference-in-Differences
#>  - first stage formula `~ 0 | state + year`
#>  - second stage formula `~ i(treat, ref = FALSE)`
#>  - The indicator variable that denotes when treatment is on is `treat`
#>  - Standard errors will be clustered by `state`
#> 
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 did2s         8.72s    8.72s     0.115     4.1GB     7.57

As a sidenote, I'm not super familiar with the package or with the code inside so it's nice to have tests to be sure I didn't mess up anything. I think it would be even better to have some checks on numerical results (even just the first coefficient) to avoid changes that don't error but modify the results.

kylebutts commented 1 year ago

Thanks! I was putting off doing that b/c I think I know how to speed that up by even more; but actual progress is much better than theoretical progress!

Fair point on having better tests. At this point I know the point estimates on the df_hom estimator, so I don't personally need them, but that's not sustainable. I'll add some tests here in a bit

kylebutts commented 1 year ago

Thanks for the inspiration, I sped it up some more (and much lower memory) 96da875. Tests added too 475b749

It is far too easy to accidentally make a sparse matrix dense... 🙃

library(data.table)
library(did2s)

foo <- list()
for (i in 1:2500) {
  g <- paste0("Group ", sample(1:3, 1))
  start_rel <- sample(c((-20):(-1), Inf), 1)
  if (is.infinite(start_rel)) {
    rel_year <- rep(Inf, 31)
  } else {
    rel_year <- seq(from = start_rel, by = 1, length.out = 31)
  }

  foo[[i]] <- data.frame(
    unit = rep(i, 31),
    state = rep(i + 10000, 31),
    group = rep(g, 31),
    rel_year = rel_year,
    dep_var = rnorm(31),
    year = 1990:2020
  )
  foo[[i]]$treat <- as.numeric(!is.infinite(foo[[i]]$rel_year) & foo[[i]]$rel_year > 0)
}

dat <- data.table::rbindlist(foo)

bench::mark(
  did2s = did2s(dat,
    yname = "dep_var", first_stage = ~ 0 | state + year,
    second_stage = ~ i(treat, ref = FALSE), treatment = "treat",
    cluster_var = "state"
  )
)

# Your fixes:
# expression      min   median `itr/sec` mem_alloc 
# did2s         3.88s    3.88s     0.257    4.04GB

# Update:
# expression      min   median `itr/sec` mem_alloc 
# did2s          2.8s     2.8s     0.357     438MB
etiennebacher commented 1 year ago

Wow, what an improvement in memory alloc, nice!