grantmcdermott / etwfe

Extended two-way fixed effects
https://grantmcdermott.com/etwfe/
Other
51 stars 11 forks source link

Speed up emfx #19

Closed frederickluser closed 1 year ago

frederickluser commented 1 year ago

Hey Grant,

I added now the option collapse_data to emfx, which will significantly improve running time for large data sets.

If collapse_data = T, then the data will be collapsed using data.table before computing marginal effects and then weighted by group-cohort sizes in the marginaleffects command itself. The results are (almost) identical to the case without collapsing. If ivar is not NULL in etwfe, collapsing is not possible and a warning message appears.

I updated the tinytests and the checks produced no errors. Do you think this would be a sensible merge? You're welcome to request changes.

Best, Frederic

grantmcdermott commented 1 year ago

Super, thanks Frederic. I've been too busy to give this my full attention, but I'm hoping to get around to it later this week. Speeding up the emfx calculations is something I'm definitely interested in adding. Two quick thoughts that come to mind (without have looked at the code!)

  1. Does this aggregation trick work with nonlinear models (e.g., Poisson)? It would be good to check that we aren't smuggling in some linearity assumption.
  2. We could think about adding some internal logic so that we automatically default to the aggregating shortcut for datasets larger than, say, 100k rows. Users should be notified ofc and could select out of this functionality. In this case, the default arg would be something like collapse = "auto".

Excited to take a proper look.

frederickluser commented 1 year ago

Dear Grant, the aggregation trick works also with non-linear models, I tried a poisson. I also added an collapse = "auto" option and tested it, but my fork is now at conflict with the marginaleffects changes.

Best, Frederic

grantmcdermott commented 1 year ago

@frederickluser We've got some conflicts that need to be resolved due to the compatibility updates with marginaleffects 0.9.0 (#20). Feel free to have a crack at these, otherwise I'll try as soon as I get a chance. Goal is to submit to CRAN by the end of the day (Pacific Time).

grantmcdermott commented 1 year ago

Hi @frederickluser I've resolved most of the conflicts, but there are bunch of test failures on the test_emfx_xvar.R file that I'm unable to resolve.

Consider the simple_known object and what I get for the (nominally equivalent) function call of emfx(x1, xvar = "xvar"). I'm focusing on the estimate values to abstract from some of the other things that need to be fixed (e.g., ordering by xvar and some minor printing changes that stem from the marginaleffects update).

> simple_known[, c("estimate", "xvar")]
      estimate xvar
1 -0.004358929    1
2 -0.077394771    2
3 -0.121365807    3

> as.data.frame(emfx(x3, xvar = "xvar"))[, c("estimate", "xvar")] |> dplyr::arrange(xvar)
     estimate xvar
1 -0.01221530    1
2 -0.06043789    2
3 -0.09087307    3

Do you know why we're getting different estimates here? I'm using the same seed as the original file (set.seed(123)), so I don't think it can be that.

Can you please check against some known output on your side. I'm worried that the test results are changing through these different PRs...

frederickluser commented 1 year ago

Hey Grant. I messed up the check somehow, sorry.

The problem was that the collapse_data="auto" now switched to collapse_data=FALSE for this small sample data. But the simple_known was ran with collapse_data=T. I checked now both cases and the results are the same as before. For the simplest solution for the test_emfx_xvar.R function, I just overwrote the default to collapse_data = T.

The difference you've shown before is just the marginal difference between collapsing and using the full data. I hope that solves the problem.

grantmcdermott commented 1 year ago

Excellent. Thanks for fixing and for this nice pull request!

grantmcdermott commented 1 year ago

Heads-up that I've tweaked these functionality a bit (e.g. in https://github.com/grantmcdermott/etwfe/commit/ea3bb3e0e8ac534d6f3613b792eba2392b27e28a). Key user-level changes are: