long computation time for objects of class nnet::multinom

rvlenth / emmeans

Estimated marginal means

https://rvlenth.github.io/emmeans/

364 stars 32 forks source link

long computation time for objects of class nnet::multinom #292

Closed markhwhiteii closed 3 years ago

markhwhiteii commented 3 years ago

Consider the following example:

library(tidyverse)
library(nnet)
library(emmeans)

# generate data ----------------------------------------------------------------
set.seed(1839)
n <- 1200
x <- factor(sample(letters[1:3], n, TRUE))

# make signal
y <- vector(length = n)
y[x == "a"] <- sample(letters[4:6], sum(x == "a"), TRUE)
y[x == "b"] <- sample(letters[4:6], sum(x == "b"), TRUE, c(1/4, 2/4, 1/4))
y[x == "c"] <- sample(letters[4:6], sum(x == "c"), TRUE, c(1/5, 3/5, 2/5))
dat <- tibble(x = x, y = factor(y))

# add noise variables
dat <- replicate(20, factor(sample(letters[7:9], n, TRUE))) %>% 
  as.data.frame() %>% 
  bind_cols(dat, .)

# one variable -----------------------------------------------------------------
m1 <- multinom(y ~ x, dat)
emmeans(m1, ~ x | y) # no problem, this returns exactly what I want

# all variables ----------------------------------------------------------------
m2 <- multinom(y ~ ., dat)
emmeans(m1, ~ x | y) # this won't ever print anything—it gets hung until I abort

I cannot get the last line to return anything. I'm working in models with many predictor variables, and I'd like to collapse across those to get the marginal probabilities for one variable in each of the outcomes. Fitting a multinomial model with 1 predictor works fine with emmeans, but R hangs until I abort if I try to do it with 21 variables.

rvlenth commented 3 years ago

This happens because your computer does not have enough memory to hold the whole problem at once. So it has to page things in and out of memory, and it seems like it takes forever. If I read your example correctly, you have 21 3-levelpredictors in this model, and a 3-level multinomial response. EMMs are based on a reference grid consisting of all possible combinations of predictors, and there are 3^21 such combinations - about 10.5 billion. So the reference grid object includes a data.frame with 10.5B rows and 21 columns, and a matrix of linear coefficients with 10.5B rows and number of columns approximately equal to 3 times the number of regression coefficients. Each of those elements consumes 8 bytes of memory. And there are other smaller things, plus your data set plus whatever else is in your workspace plus temporary storage needed in function calls.

One thing I might suggest is to fit a model with most of the nuisance variables, then use predictions from that model as an offset in your main model. This obviously isn't the same model as you fitted, but at least you're accounting for the nuisance variables, and you get a reference grid of manageable size.

markhwhiteii commented 3 years ago

Yes, you're reading the code correctly. Thanks for the explanation. What would the code for that offset look like? And I'm surprised a model that isn't that large is giving it such big problems—is there an easier way to simplify these reference grids so it's not exponential?

rvlenth commented 3 years ago

I am traveling and unable to answer further in detail. It would not be at all straightforward to modify the code design, as the reference grid is fundamental to these techniques.

rvlenth commented 3 years ago

It may in principle be possible to pre-eliminate some nuisance factors as the reference grid is built, assuming they don't interact with other factors, but it would require a major re-design of the ref_grid() function, which already has complexities with tracking unused factor levels, multivariate responses, etc. I will think about this, but don't hold your breath that any remedy will be forthcoming anytime soon.

rvlenth commented 3 years ago

Here's an appetizer:

> emmeans(m2, ~ x | y, non.nuis = c("x","y"))
y = d:
 x  prob     SE df lower.CL upper.CL
 a 0.436 0.0230 86    0.390    0.482
 b 0.333 0.0225 86    0.288    0.377
 c 0.231 0.0192 86    0.193    0.269

y = e:
 x  prob     SE df lower.CL upper.CL
 a 0.436 0.0230 86    0.390    0.482
 b 0.333 0.0225 86    0.288    0.377
 c 0.231 0.0192 86    0.193    0.269

y = f:
 x  prob     SE df lower.CL upper.CL
 a 0.436 0.0230 86    0.390    0.482
 b 0.333 0.0225 86    0.288    0.377
 c 0.231 0.0192 86    0.193    0.269

Results are averaged over the levels of: 20 nuisance factors 
Confidence level used: 0.95 

> ref_grid(m2, non.nuis = c("x","y"))
'emmGrid' object with variables:
    x = a, b, c
    y = multivariate response levels: d, e, f
    Nuisance factors that have been collapsed by averaging:
    V1(3), V2(3), V3(3), V4(3), V5(3), V6(3), V7(3), V8(3), V9(3), V10(3), V11(3), V12(3), V13(3), V14(3), V15(3), V16(3), V17(3), V18(3), V19(3), V20(3)

The user interface requires specifying a list of nuisance factors (or its complement, as is done here).

The key was having the insight that factors that don't interact with anything produce the same average predictions for any combination of other factors. That makes it possible to obtain marginals for those factors without actually doing it explicitly.

I need to clean-up some parts of the code before I push it up to GitHub, but once it's there, you can install and use it. Cleaning up will include testing for reference grids that will exceed a certain size (that can be user-specified), and creating an error condition and informative message. This issue convinces me that that is necessary because without it, people can get into the same memory-paging dilemma you experienced. I'm not ready (at least as of yet) to use the nuisance-factor code by default, because that limits what kinds of averaging is done and I'm not 100% certain it's bullet-proof.

rvlenth commented 3 years ago

Plus I guess I already have a counterexample where it doesn't work right:

> emmeans(m2, ~ x | y, non.nuis = c("x","y", "V3", "V12", "V15", "V18"))
y = d:
 x  prob       SE df lower.CL upper.CL
 a 0.333 1.87e-10 86    0.333    0.333
 b 0.333 1.84e-10 86    0.333    0.333
 c 0.333 2.29e-10 86    0.333    0.333

y = e:
 x  prob       SE df lower.CL upper.CL
 a 0.333 1.87e-10 86    0.333    0.333
 b 0.333 1.84e-10 86    0.333    0.333
 c 0.333 2.29e-10 86    0.333    0.333

y = f:
 x  prob       SE df lower.CL upper.CL
 a 0.333 1.87e-10 86    0.333    0.333
 b 0.333 1.84e-10 86    0.333    0.333
 c 0.333 2.29e-10 86    0.333    0.333

Results are averaged over the levels of: 16 nuisance factors, V3, V12, V15, V18 
Confidence level used: 0.95

This should have yielded exactly the same estimates...

rvlenth commented 3 years ago

I think I have it now. When there is a multivariate model (including multinomial), the X matrix gets expanded in blocks, and we have to do special bookkeeping and work through each multivariate response. So now I have:

> emmeans(m2, ~ x | y, non.nuis = "x")
y = d:
 x  prob     SE df lower.CL upper.CL
 a 0.361 0.0242 86    0.313    0.410
 b 0.248 0.0230 86    0.202    0.293
 c 0.171 0.0198 86    0.132    0.210

y = e:
 x  prob     SE df lower.CL upper.CL
 a 0.309 0.0232 86    0.263    0.355
 b 0.560 0.0264 86    0.508    0.613
 c 0.551 0.0264 86    0.499    0.604

y = f:
 x  prob     SE df lower.CL upper.CL
 a 0.330 0.0238 86    0.282    0.377
 b 0.192 0.0207 86    0.151    0.233
 c 0.278 0.0238 86    0.230    0.325

Results are averaged over the levels of: 20 nuisance factors 
Confidence level used: 0.95

Note that now we have separate results for each y, whereas before we didn't, and that is explained entirely by the bookkeeping issue.

If we cut back on the nuisance factors, we now get reasonable results:

> emmeans(m2, ~ x | y, non.nuis = c("x", "V3", "V16"))
y = d:
 x  prob     SE df lower.CL upper.CL
 a 0.359 0.0240 86    0.312    0.407
 b 0.246 0.0229 86    0.201    0.292
 c 0.170 0.0197 86    0.131    0.209

y = e:
 x  prob     SE df lower.CL upper.CL
 a 0.308 0.0230 86    0.262    0.353
 b 0.557 0.0263 86    0.505    0.610
 c 0.549 0.0262 86    0.496    0.601

y = f:
 x  prob     SE df lower.CL upper.CL
 a 0.333 0.0235 86    0.286    0.380
 b 0.196 0.0206 86    0.155    0.237
 c 0.281 0.0234 86    0.235    0.328

Results are averaged over the levels of: 18 nuisance factors, V3, V16 
Confidence level used: 0.95

These results are close, but not identical, to the previous ones -- and for an interesting reason. The default method for multivariate models is to compute the reference grid on the latent scale and then post-process it to display probabilities. So in the first example, we averaged over the 20 nuisance variables on the latent scale, then converted to probabilities. In the second one, we averaged over 18 nuisance variables on the latent scale, then converted to probabilities, then averaged over two more variables on the probability scale. I can verify that we get exactly the same results if everything's on the latent scale:

> emmeans(m2, ~ x * y, non.nuis = "x", mode = "latent")
 x y   emmean     SE df lower.CL upper.CL
 a d  0.08315 0.0700 86   -0.056   0.2223
 b d -0.18752 0.0850 86   -0.356  -0.0186
 c d -0.55144 0.0941 86   -0.738  -0.3644
 a e -0.07451 0.0726 86   -0.219   0.0697
 b e  0.62903 0.0717 86    0.486   0.7716
 c e  0.61904 0.0723 86    0.475   0.7629
 a f -0.00864 0.0718 86   -0.151   0.1341
 b f -0.44151 0.0902 86   -0.621  -0.2621
 c f -0.06760 0.0828 86   -0.232   0.0970

Results are averaged over the levels of: 20 nuisance factors 
Results are given on the log (not the response) scale. 
Confidence level used: 0.95 

> emmeans(m2, ~ x * y, non.nuis = c("x", "V3", "V16"), mode = "latent")
 x y   emmean     SE df lower.CL upper.CL
 a d  0.08315 0.0700 86   -0.056   0.2223
 b d -0.18752 0.0850 86   -0.356  -0.0186
 c d -0.55144 0.0941 86   -0.738  -0.3644
 a e -0.07451 0.0726 86   -0.219   0.0697
 b e  0.62903 0.0717 86    0.486   0.7716
 c e  0.61904 0.0723 86    0.475   0.7629
 a f -0.00864 0.0718 86   -0.151   0.1341
 b f -0.44151 0.0902 86   -0.621  -0.2621
 c f -0.06760 0.0828 86   -0.232   0.0970

Results are averaged over the levels of: 18 nuisance factors, V3, V16 
Results are given on the log (not the response) scale. 
Confidence level used: 0.95

So an interesting byproduct of this nuisance-variable processing is that it makes it easy to do the averaging on the latent scale, which in my opinion is more desirable.

markhwhiteii commented 3 years ago

Wow, this is all incredible. Thank you @rvlenth for the detailed explanation and for hopping on this quickly. I think it's a great development, because it allows people to fit a larger multinomial model and isolate significant predictors instead of looking at a giant book of crosstabs, which is how many folks look at polling data currently. This model-based approach, with the ability to look at marginal probabilities, makes it easier for one to look at many variables at once—and not get fooled by illusory correlations one might see in traditional banner books.

Let me know if you need anyone to review or test code—glad to help here in any way that I can.

rvlenth commented 3 years ago

Thanks. I'm pretty pleased that I was able to figure this out, as it is definitely worthwhile to not waste so much storage on nuisance factors -- and in your case, to make it even possible to deal with them.

In terms of testing, what would help is if, in the next couple of weeks, you could try out some other datasets you have in your recent work -- especially ones that may be more complex, e.g., models involving interactions of some primary factors. (Declared nuisance factors that interact are silently tossed out of the nuisance pool, by the way.) But also models with fewer variables that you can try and compare with and without the provisions for nuisance factors. And of course let me know if any issues come up. Obviously this involves installing the latest update from GitHub; let me know if this is problematic and I can build a source package for you.

Also, I want to mention that for observational data, you may find it to be more appropriate to weight the averaging of nuisance factors according to observed proportions rather than equally. This is provided via the argument wt.nuis = "prop" versus the default wt.nuis = "equal" (and in emmeans() with its weights argument). Actually anything besides wt.nuis = "equal" results in the same proportional averaging; and since the averaging is doe one factor at a time, this corresponds to weights = "outer" in emmeans().

My own testing will be on other univariate and multivariate models such as ordinal responses, and cases with model formulas containing function calls (e.g.,, y ~ ... + factor(v7) and dataset references (e.g. y ~ ... + dat$v3), as these are the kinds of things that can mess-up being able to identify the nuisance predictors.

Thanks for lighting a fire under me to get this done!

markhwhiteii commented 3 years ago

Awesome, I can do that testing! What GitHub branch should I install to test?

rvlenth commented 3 years ago

Just the main one. I do not do sophisticated git things...

Thanks

Russ

Sent from my iPad

On Sep 20, 2021, at 1:26 PM, Mark H. White II @.***> wrote:

Awesome, I can do that testing! What GitHub branch should I install to test?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/rvlenth/emmeans/issues/292#issuecomment-923172635, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGMJPLYDDTAIGZ547IREEITUC54GRANCNFSM5DJZKBOA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

rvlenth commented 3 years ago

Please note that since your most recent comment, I have committed an update that makes some minor repairs to the code. They affect situations with function calls in the model formula, like factor(z) or poly(x1, x2, degree = 3), which were not correctly detecting whether a factor interacts or not with other factors. If you don't have stuff like that in the models you are testing, it shouldn't matter if you reinstall with the most recent push.

rvlenth commented 3 years ago

I think this issue is resolved, so am closing it. You did raise an important question that led to a significant enhancement to the package -- the nuisance specification and related matters. Thanks!

markhwhiteii commented 2 years ago

Yes, definitely, thank you! Making a blog post of this new functionality (and why to do it instead of banner books) is the next blog post I'm working on, so hopefully that helps get the word out about this great feature.

markhwhiteii commented 2 years ago

Put together a little simulation here, where there are correlated variables, but only 2 predict the actual outcome. The full code to reproduce the analysis is below. But see the results:

> # get predicted probabilities -------------------------------------------------=
> emmeans(m1, ~ y | a1, non.nuis = "a1")
a1 = 1:
 y     prob       SE df  lower.CL upper.CL
 1 0.228686 0.072293 74  0.084639  0.37273
 2 0.770477 0.072292 74  0.626432  0.91452
 3 0.000837 0.000594 74 -0.000348  0.00202

a1 = 2:
 y     prob       SE df  lower.CL upper.CL
 1 0.091781 0.035036 74  0.021970  0.16159
 2 0.904481 0.035294 74  0.834157  0.97480
 3 0.003738 0.002499 74 -0.001241  0.00872

Results are averaged over the levels of: 14 nuisance factors 
Confidence level used: 0.95 
> # compare to table function, standard crosstabs:
> with(dat, prop.table(table(a1, y), 1))
   y
a1           1          2          3
  1 0.35496183 0.62213740 0.02290076
  2 0.14495798 0.74789916 0.10714286
> 
> emmeans(m1, ~ y | b1, non.nuis = "b1")
b1 = 1:
 y     prob       SE df  lower.CL upper.CL
 1 0.442692 0.099328 74  0.244778 0.640607
 2 0.557155 0.099317 74  0.359262 0.755048
 3 0.000153 0.000172 74 -0.000190 0.000496

b1 = 2:
 y     prob       SE df  lower.CL upper.CL
 1 0.144150 0.048531 74  0.047450 0.240850
 2 0.854208 0.048591 74  0.757389 0.951028
 3 0.001642 0.001076 74 -0.000502 0.003785

b1 = 3:
 y     prob       SE df  lower.CL upper.CL
 1 0.036880 0.019599 74 -0.002172 0.075932
 2 0.946052 0.023007 74  0.900210 0.991893
 3 0.017069 0.010821 74 -0.004493 0.038630

Results are averaged over the levels of: 14 nuisance factors 
Confidence level used: 0.95 
> # compare to table function, standard crosstabs:
> with(dat, prop.table(table(b1, y), 1))
   y
b1           1          2          3
  1 0.53500000 0.46000000 0.00500000
  2 0.21331316 0.74281392 0.04387292
  3 0.05035971 0.71223022 0.23741007
>

These differ because there are correlated predictors, and we are parceling that out when using emmeans with non.nuis, correct? I'm just making sure that non.nuis is working like we think it should. Note that the definition of y is just "as a1 and b1 increase, so does y," so the results make sense. Double-checking before I make a post about it!

library(nnet)       # for multinomial logistic regression
library(emmeans)    # for predicted probabilities
library(tidyverse)  # for data cleaning
set.seed(1839)
n <- 1000

# simulating data --------------------------------------------------------------
## define correlation matrix for a variables
sigma_a <- matrix(rep(.5, 9), nrow = 3)
diag(sigma_a) <- 1

## and for b variables
sigma_b <- matrix(rep(.2, 25), nrow = 5)
diag(sigma_b) <- 1

## and for c variables
sigma_c <- matrix(rep(.3, 49), nrow = 7)
diag(sigma_c) <- 1

## then create these variables, joining them into one data frame
dat <- mvtnorm::rmvnorm(n, sigma = sigma_a) %>% 
  as_tibble() %>% 
  set_names(paste0("a", 1:ncol(.))) %>% 
  bind_cols({
    mvtnorm::rmvnorm(n, sigma = sigma_b) %>% 
      as_tibble() %>% 
      set_names(paste0("b", 1:ncol(.)))
  }) %>% 
  bind_cols({
    mvtnorm::rmvnorm(n, sigma = sigma_c) %>% 
      as_tibble() %>% 
      set_names(paste0("c", 1:ncol(.)))
  })

## generate outcome, y, which is only predicted by a1 and b1
dat$y <- .5 * dat$a1 + .7 * dat$b1 + rnorm(n)

## turn all columns into discrete variables
dat <- dat %>% 
  mutate(
    y = factor(cut(y, breaks = 3, labels = FALSE)),
    across(-y, ~ factor(cut(.x, breaks = sample(2:5, 1), labels = FALSE)))
  )

# fit model --------------------------------------------------------------------
m1 <- multinom(y ~ ., dat)
m1_tests <- MASS::dropterm(m1, test = "Chisq")
m1_tests

# get predicted probabilities -------------------------------------------------=
emmeans(m1, ~ y | a1, non.nuis = "a1")
# compare to table function, standard crosstabs:
with(dat, prop.table(table(a1, y), 1))

emmeans(m1, ~ y | b1, non.nuis = "b1")
# compare to table function, standard crosstabs:
with(dat, prop.table(table(b1, y), 1))

rvlenth commented 2 years ago

I'm not sure I fully understand the objectives here, but my impression is that we are comparing apples and oranges. Recall that by default, emmeans() computes equally-weighted averages, and you seem to be comparing those with marginal averages in the population, which are not equally-weighted. So I think the comparisons will be more sensible if you add the arguments weights = "outer", wt.nuis = "prop" to the emmeans call. (Actually weights gets passed to ref_grid() as wt.nuis, and anything not matching "equal" is interpreted as "prop".)

But why are we not comparing results of emmeans(m1, ~ y | a1 * b1, weights = "outer")?