vincentarelbundock / marginaleffects

R package to compute and plot predictions, slopes, marginal means, and comparisons (contrasts, risk ratios, odds, etc.) for over 100 classes of statistical and ML models. Conduct linear and non-linear hypothesis tests, or equivalence tests. Calculate uncertainty estimates using the delta method, bootstrapping, or simulation-based inference
https://marginaleffects.com
Other
466 stars 47 forks source link

Support: `mlogit` #159

Closed vincentarelbundock closed 2 years ago

vincentarelbundock commented 2 years ago

The predict() method of the mlogit package seems to behave weirdly. We need one prediction per row per level of the outcome variable. How can we get those?

library(mlogit)
library(AER)
data("TravelMode", package = "AER")
dat <- TravelMode[-c(2, 7, 9),]
mod <- mlogit(choice ~ wait + gcost | income + size, dat)
predict(mod, newdata = head(dat))
  ## Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 2, 0

library(mlogit)
library(AER)

data("TravelMode", package = "AER")
dat <- TravelMode[-c(2, 7, 9),]
mod <- mlogit(choice ~ wait + gcost | income + size, dat)

dat2 <- dat
dat2$gcost <- dat2$gcost + 10

predict(mod)
#> Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 2, 0

predict(mod, newdata = dat)
#> Error in predict.mlogit(mod, newdata = dat): the number of rows of the data.frame should be a multiple of the number of alternatives

predict(mod, newdata = dat2)
#> Error in predict.mlogit(mod, newdata = dat2): the number of rows of the data.frame should be a multiple of the number of alternatives
NorbertSch101 commented 2 years ago

Hi,

I just wanted to open an issue for asking You if the marginaleffects functions could be adpated to the mlogit model objects (MNL, Nisted-, Heteroscedastic, RE-.MNL, Probit). However, I saw this post and now I guess that that has been already approached.

Obviously I had luck and the predict function is doing what it is expected to do. If You estimate and predict the model with the full Travel-Mode Data, everything is fine. But due to your (alternativewise) deletion the choice set is unbalanced for the first three individuals. So the predict function seemingly does not work, when the choice set is unbalanced.

Maybe the fitted function is useful and allows an adaption for marginaleffects?

I work on Missing-Data currently and just tried predict and fitted by using my own datasets, deletion procedures and Discrete-Choice Models. Here it is the same: While predict does not work on my alternativewise deleted choice data, fitted does.

Best regard,

N

vincentarelbundock commented 2 years ago

Thanks for the comment and investigation. Would you be able to show me a replicable example of this? Everything I tried fails:

library(mlogit)
library(AER)

data("TravelMode", package = "AER")
dat <- TravelMode[-c(2, 7, 9),]
mod <- mlogit(choice ~ wait + gcost | income + size, dat)

dat2 <- dat
dat2$gcost <- dat2$gcost + 10

predict(mod)
#> Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 2, 0

predict(mod, newdata = dat)
#> Error in predict.mlogit(mod, newdata = dat): the number of rows of the data.frame should be a multiple of the number of alternatives

predict(mod, newdata = dat2)
#> Error in predict.mlogit(mod, newdata = dat2): the number of rows of the data.frame should be a multiple of the number of alternatives
NorbertSch101 commented 2 years ago

Hey!

I do not know it precisely but I think that mlogit is widely used and it might be a good move if Your package is able to deal with mlogit objects. Thus if You are interested to integrate mlogit into marginaleffects I would appreciate it if we can work together on that. Although I am new on github, I work for three years with mlogit so I have some experience on how the package works and the objects are organized.

The package ggeffect has already a predict routine for the mlogit package. This routine relies on the _datagrid function which organizes a data matrix to yield statistics for some variables to hold constant (at the mean for example) and include variation for those variables one is interested to investigate the marginal effects. Yet the ggeffect predict function does not work, because the required data_grid function fails to manage the data matrix in a proper way for mlogit. If You are interested I can upload a workspace and the script for demonstration purposes. I have already written a function on my own to estimate marginal effects on newdata. It works generally in that way that I define a set of vars, for which I am interested in obtaining marginal effects for a defined variation of their scale values while holding other vars on their mean. I copied the self written function. Though suited for my specific analyses it could be more generally defined.

Regarding Your question the error has to do with the issues of my last message. mlogit is using even for cross section data the long format and this has implications for both estimation and prediction. The long format results from the individuals n=1,..,N and the Choice set k=1,...,K and typically the data matrix contains the number of rows RN=N*K.

There is slightly a difference between the data handling for estimation and prediction: If and only if for every individual there exists a chosen alternative, mlogit is able to estimate models on data matrices smaller than RN. But this does not work for the predict function. Here the full number of rows RN is strictly demanded.

On the one hand this seems to be strange, because for an individual the probabilitiy of alternative k represent the relative share of the exponentiated value X_beta to the sum of the exponentiated X_beta for all alternatives and could be calculated for Choice Sets smaller than K. From this point of view easily a self written function could be installed which calculate in the first the linear predictions for all individuals even with missing alternatives and in the second steps the probabilities.

On the other hand this could substantially introduce bias, especially in small data sets: If You imagine a scenario in which someone wants to predict probs where one half of the has the full choice set (e.g. 10 alternatives) while the other half has only two alternatives. These two alternatives may have in full choice sets small probs but logically in the data part with missing alternatives not. Moreover, how would one calculate Average Marginal Effects? Assigning zeros for the missing alternatives and reduce the average for the values of the full Choice Sets? I assume that this is the reason for the behavior of the predict function.

Kind regards,

Norbert

prob_mefx_mlogit <- function(model, partei, vars, variation) {

  ##########################################
  ##########################################
  var.dis <- vars[1:4]
  var.vec <- vars
  dat <- insight::get_data(model)
  idx <- paste0(rep(levels(dat$idx$id2), length(variation)), ":",
                sort(rep(variation, length(partei))) )
  partei.vec <- rep(partei, length=dim(dat)[1])
  dat <- cbind(dat, partei.vec)

  ##########################################
  ##########################################
  partei.mw <- dat %>%
    group_by(partei.vec) %>%
    summarize_at(vars(var.vec),mean) %>%
    select(!partei.vec)

  ##########################################
  ##########################################
  ##########################################
  ##########################################

  for (dis in var.dis){
    col.num.var <- which(colnames(partei.mw)==dis)
    assign(paste0("partei.mw.prep.", dis) ,  NULL )
    assign(paste0("Tabelle.", dis) ,  NULL )

    ##########################################
    ##########################################
    for (part in partei){
      row.num.part <- which(partei==part)
      assign(paste0(paste0("partei.mw.prep.", dis), ".", part) ,  NULL )

      ##########################################
      ##########################################  
      for (val in variation){
        temp <- partei.mw
        temp[row.num.part, col.num.var] <- val
        assign(paste0(paste0("partei.mw.prep.", dis), ".", part) ,
               rbind(get(paste0(paste0("partei.mw.prep.", dis), ".", part)),temp ) )  
        }
      ##########################################
      ##########################################

      assign(paste0(paste0("partei.mw.prep.", dis), ".", part),
             cbind(get(paste0(paste0("partei.mw.prep.", dis), ".", part)), idx ) )

      assign(paste0(paste0("prdat.", dis), ".", part),
             predict(model, newdata= get(paste0(paste0("partei.mw.prep.", dis), ".", part)) ) )

      assign(paste0(paste0("pr.me.val.", dis), ".", part),
             round(cbind( get(paste0(paste0("prdat.", dis), ".", part))[,row.num.part],
                          get(paste0(paste0("prdat.", dis), ".", part))[,row.num.part]*coef(model)[dis])
                   , digits=3)  )            

      assign(paste0("Tabelle.", dis),
             cbind(get(paste0("Tabelle.", dis)),
                   get(paste0(paste0("pr.me.val.", dis), ".", part))) )
    }
    ##########################################
    ##########################################

    assign(paste0("Tabelle.", dis), cbind(variation, get(paste0("Tabelle.", dis)) )  )
    ##########################################
    ##########################################
  }

  ##########################################
  ##########################################
  ##########################################
  ##########################################

  nam.wsk = paste0("WSK ", partei)
  nam.me =paste0("ME ", partei)
  comb.vec <- integer(2*length(nam.wsk))
  comb.vec1 <- 2L * seq_along(nam.wsk)
  comb.vec[comb.vec1] <- nam.me
  comb.vec[comb.vec1 - 1L] <- nam.wsk
  colname.nam <- c("Distanz", comb.vec)

  colnames(Tabelle.disLR) <- colname.nam
  colnames(Tabelle.disSOE) <- colname.nam
  colnames(Tabelle.disLA) <- colname.nam
  colnames(Tabelle.disKE) <- colname.nam

  rowname.lr <- paste0("LR: ", variation)
  rowname.soe <- paste0("SÖ: ", variation)
  rowname.la <- paste0("LA: ", variation)
  rowname.ke <- paste0("KE: ", variation)

  rownames(Tabelle.disLR) <- rowname.lr
  rownames(Tabelle.disSOE) <- rowname.soe
  rownames(Tabelle.disLA) <- rowname.la
  rownames(Tabelle.disKE) <- rowname.ke

  fit <- list(LR = Tabelle.disLR, SOE = Tabelle.disSOE,
              LA = Tabelle.disLA, KE = Tabelle.disKE)

  ############## 
}   
vincentarelbundock commented 2 years ago

Hi, I tried to use the prediction function in ggeffects:::get_predictions_mlogit but even that doesn't seem to work on the models shown above.

If you can find or write a predict() function that accepts (1) a mlogit model and (2) a custom data.frame over which we want to make predictions, I can try to make this work in marginaleffects. However, I don't know this specific procedure well enough to design a predict function myself (and I don't have the time to learn it).

But again, if you can supply a predict function in which you are very confident, I'll be happy to work with you to see if I can make it work in marginaleffects.

The other option would be to contact the developers and show that the current predict function is broken, and ask if they might be willing to release a fix.

NorbertSch101 commented 2 years ago

Hey there.

Nice, let us try to figure this out! I will open a repository where we can work on that. Just give me two or three days to organize the repos

Best,

Norbert

Am Di., 19. Apr. 2022 um 22:34 Uhr schrieb Vincent Arel-Bundock < @.***>:

Hi, I tried to use the prediction function in ggeffects:::get_predictions_mlogit but even that doesn't seem to work on the models shown above.

If you can find or write a predict() function that accepts (1) a mlogit model and (2) a custom data.frame over which we want to make predictions, I can try to make this work in marginaleffects. However, I don't know this specific procedure well enough to design a predict function myself (and I don't have the time to learn it).

But again, if you can supply a predict function in which you are very confident, I'll be happy to work with you to see if I can make it work in marginaleffects.

The other option would be to contact the developers and show that the current predict function is broken, and ask if they might be willing to release a fix.

— Reply to this email directly, view it on GitHub https://github.com/vincentarelbundock/marginaleffects/issues/159#issuecomment-1103121024, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYXRKOOJTWWWRPSNNTX2LCTVF4KFTANCNFSM5KNZIWNQ . You are receiving this because you commented.Message ID: @.***>

vincentarelbundock commented 2 years ago

@NorbertSch101 FYI, I just emailed the mlogit maintainer. It would be much better if this could be fixed upstream than here. And he might show us that we made an obvious mistake...

vincentarelbundock commented 2 years ago

@NorbertSch101 FYI, I have been in touch with the mlogit developer and he seems interested in looking for an upstream fix. I think we should wait a bit before diving too deep into this.

vincentarelbundock commented 2 years ago

I looked into this a bit more and have identified other challenges to implementation:

  1. All the models supported are estimated using the same input data: 1 row per observation. mlogit models have multiple potential input data types, including a special mlogit.data class, and a "long" format with one row per choice/observation (i.e., multiple rows per observation). I would have to
vincentarelbundock commented 2 years ago

@NorbertSch101

FYI, the development version of marginaleffects now includes support for mlogit models. There are a couple limitations that cannot be worked around without changes upstream:

  1. The newdata argument is not supported
  2. In a model like this choice ~ wait + gcost | income + size, the marginal effects will always come up as 0 for wait and gcost, because the predict.mlogit method in the mlogit package does not produce different predictions when we use newdata with different values of the predictors before the vertical bar in the formula.

Aside from that, things seem to work quite well:

library("AER")
library("mlogit")
library("marginaleffects")
data("TravelMode", package = "AER")

mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode)
mfx <- marginaleffects(mod, variables = c("income", "size"))
summary(mfx)
#> Average marginal effects 
#>   Group   Term    Effect Std. Error z value  Pr(>|z|)      2.5 %    97.5 %
#> 1   air income  0.002786   0.001218  2.2878 0.0221493  0.0003992  0.005172
#> 2   bus income -0.000372   0.001103 -0.3373 0.7358957 -0.0025337  0.001790
#> 3   car income  0.003373   0.001373  2.4561 0.0140448  0.0006814  0.006065
#> 4 train income -0.005787   0.001319 -4.3859 1.155e-05 -0.0083727 -0.003201
#> 5   air   size -0.126475   0.028922 -4.3729 1.226e-05 -0.1831612 -0.069788
#> 6   bus   size  0.011348   0.025867  0.4387 0.6608765 -0.0393502  0.062046
#> 7   car   size  0.045886   0.024754  1.8536 0.0637901 -0.0026319  0.094404
#> 8 train   size  0.069241   0.024784  2.7938 0.0052088  0.0206661  0.117816
#> 
#> Model type:  mlogit 
#> Prediction type:  response
NorbertSch101 commented 2 years ago

Hey Vince,

I worked on that too, but didn't finished that yet. I tried to implement with tidyr most of the functions.

Due to my lack of experience in building such general functions I had to trie around. The crucial thing is simply to expand the datagrid function to the long format and manipulate data in a custom fashioned way...

I write to you again asap...

N

Holen Sie sich Outlook für Androidhttps://aka.ms/AAb9ysg


From: Vincent Arel-Bundock @.> Sent: Tuesday, May 3, 2022 9:03:25 PM To: vincentarelbundock/marginaleffects @.> Cc: NorbertSch101 @.>; Mention @.> Subject: Re: [vincentarelbundock/marginaleffects] Support: mlogit (Issue #159)

@NorbertSch101https://github.com/NorbertSch101

FYI, the development version of marginaleffects now includes support for mlogit models. There are a couple limitations that cannot be worked around without changes upstream:

  1. The newdata argument is not supported
  2. In a model like this choice ~ wait + gcost | income + size, the marginal effects will always come up as 0 for wait and gcost, because the predict.mlogit method in the mlogit package does not produce different predictions when we use newdata with different values of the predictors before the vertical bar in the formula.

Aside from that, things seem to work quite well:

library("AER") library("mlogit") library("marginaleffects") data("TravelMode", package = "AER")

mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode) mfx <- marginaleffects(mod, variables = c("income", "size")) summary(mfx)

> Average marginal effects

> Group Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %

> 1 air income 0.002786 0.001218 2.2878 0.0221493 0.0003992 0.005172

> 2 bus income -0.000372 0.001103 -0.3373 0.7358957 -0.0025337 0.001790

> 3 car income 0.003373 0.001373 2.4561 0.0140448 0.0006814 0.006065

> 4 train income -0.005787 0.001319 -4.3859 1.155e-05 -0.0083727 -0.003201

> 5 air size -0.126475 0.028922 -4.3729 1.226e-05 -0.1831612 -0.069788

> 6 bus size 0.011348 0.025867 0.4387 0.6608765 -0.0393502 0.062046

> 7 car size 0.045886 0.024754 1.8536 0.0637901 -0.0026319 0.094404

> 8 train size 0.069241 0.024784 2.7938 0.0052088 0.0206661 0.117816

>

> Model type: mlogit

> Prediction type: response

— Reply to this email directly, view it on GitHubhttps://github.com/vincentarelbundock/marginaleffects/issues/159#issuecomment-1116458130, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AYXRKOLTFYLCWQVWCV2POU3VIFZ73ANCNFSM5KNZIWNQ. You are receiving this because you were mentioned.Message ID: @.***>

vincentarelbundock commented 2 years ago

Maybe it would be a good idea to let me know a bit more details on what you are working on. For instance, I will not merge code that requires more dependencies (e.g., tidyr), and I think we want to be quite careful about adding stuff that should be in upstream packages. I just wouldn't want you to put in a ton of work on something unless it has a good chance of being accepted and merged eventually.

The problem with newdata is actually not that we need balanced outcome (that is one problem, but a trivial one). The problem with newdata is that the user may not always supply an individual-choice index that allows us to align the data internally. So we can't know for sure what prediction goes with what row of the original newdata.

NorbertSch101 commented 2 years ago

For sure. The it is possible to expand the idx var complete

Holen Sie sich Outlook für Androidhttps://aka.ms/AAb9ysg


From: Vincent Arel-Bundock @.> Sent: Tuesday, May 3, 2022 9:31:35 PM To: vincentarelbundock/marginaleffects @.> Cc: NorbertSch101 @.>; Mention @.> Subject: Re: [vincentarelbundock/marginaleffects] Support: mlogit (Issue #159)

Maybe it would be a good idea to let me know a bit more details on what you are working on. For instance, I will not merge code that requires more dependencies (e.g., tidyr), and I think we want to be quite careful about adding stuff that should be in upstream packages. I just wouldn't want you to put in a ton of work on something unless it has a good chance of being accepted and merged eventually.

The problem with newdata is actually not that we need balanced outcome (that is one problem, but a trivial one). The problem with newdata is that the user may not always supply an individual-choice index that allows us to align the data internally. So we can't know for sure what prediction goes with what row of the original newdata.

— Reply to this email directly, view it on GitHubhttps://github.com/vincentarelbundock/marginaleffects/issues/159#issuecomment-1116483413, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AYXRKOOR7WZJNKKG2BXTD23VIF5JPANCNFSM5KNZIWNQ. You are receiving this because you were mentioned.Message ID: @.***>

vincentarelbundock commented 2 years ago

I'm not sure how that’s possible. For example, this newdata does not have any information about the index, yet it still produces predictions, and we have no way of knowing which individual they refer to:

library("AER")
library("mlogit")
data("TravelMode", package = "AER")

mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode)

nd <- TravelMode[1:4, c("wait", "gcost", "income", "size")]

predict(mod, newdata = nd)
#>       air     train       bus       car 
#> 0.1693768 0.2727905 0.1783145 0.3795182

Moreover, if we just change the row ordering we get completely different results, and we have no way to verify the row order, since there is no no index in newdata:

nd <- nd[c(2, 1, 3, 4),]
predict(mod, newdata = nd)
#>         air       train         bus         car 
#> 0.911253079 0.001247517 0.027969698 0.059529706

Of course, there is an index stored inside the model object:

mod$model$idx
#> ~~~ indexes ~~~~
#>    individual  mode
#> 1           1   air
#> 2           1 train
#> 3           1   bus
#> 4           1   car
#> 5           2   air
#> 6           2 train
#> 7           2   bus
#> 8           2   car
#> 9           3   air
#> 10          3 train
#> indexes:  1, 2

But my point is that we have no way of matching that index to an arbitrary newdata supplied by the user. And I was unable to find where the column names for the original index are stored in the model object.

NorbertSch101 commented 2 years ago

Hey Vince,

here a longer reply to Your Emails

############################# ####### Your Email yesterday: #############################

But my point is that we have no way of matching that index to an arbitrary

newdata supplied by the user.

And I was unable to find where the column names for the original index are

stored in the model object.

Moreover, if we just change the row ordering we get completely different

results,

and we have no way to verify the row order, since there is no no index in

newdata:

nd <- nd[c(2, 1, 3, 4),]

For sure, You are completely right! It is just allowed to rotate the

order of the N´s, but not the k´s within N.

But this is the same for all structured long data like panel data etc...

Users of mlogit shoulda know

The idx var is construced by dfidx or pivot_longer: There you have to

assign the columns

for identification of the observations plus the choice / outcome variable

in order to yield a long dataframe with N*k rows (k=number of alternatives)

Generally it would be possible to decompose the idx of the original data.

Perhaps I am too optimistic, but I believe that most of the applications

of marginaleffects include a model and with its data (insight::)

the intended calculations are build on the correct data structure,

especially when the datagrid function consequently expands the data by strictly

consider the alternative vector as I have tried to fulfill with the

procedures in the below, which work with altset and altset.vec

dat <- insight::get_data(model) %>% mutate(altset.vec = rep(levels(dat$idx$id2), length=dim(dat)[1])) altset <- levels(dat$idx$id2) ### Just categories of depvar y

Below I wrote some lines, which enables the possibility to check this.

(Control for exact mapping of rows to order of choice set)

Maybe this could also be done via a customized sanitize function for

mlogit. (Though I believe that won´t be necessary)

I really would like to give You a more solid statement and yday I studied

Your code in detail (predictions, datagrid, marginaleffects).

I mostly understood the procedure, but there are two pieces missing:

1.:I was not able to find the code for the prep_datagrid, even though I

googled it.

Somehow with this function a variable 'at' is constructed, which is used

extensively in the following. Here the lines, I am talking about:

tmp <- prep_datagrid(..., model = model, newdata = newdata) at <- tmp$at dat <- tmp$newdata variables_all <- tmp$all variables_manual <- tmp$variables_manual variables_automatic <- tmp$variables_automatic

at -> data.frame

at <- lapply(at, unique) at <- expand.grid(at, stringsAsFactors = FALSE)

2.: I was not able to update the package nor a copy of the new functions

brought the ability to handle mlogit objects.

I copied the sanitize model routine, too. Is there a link to the beta

version???

I hope, I don´t ask silly questions, but the truth is, I am by far not so

advanced in coding efficiently than You.

I use github for the first time and hopefully in the evening I will find

some time to open a repos and load some data, where I can

give You the data and model of my application.

CU,

Norbert

############################################ ############# Get the formula object to get independent vars ############################################

useful for checkmate, sanitize, expanding datagrid for represantive

values and counterfactuals

There has to be a check and exklusion of "0", "1", or "-1",.. could write

that, when neccessary

var.exkl.form <- c("-1", "0", "1")

form.chr <- unlist(strsplit(as.character(attributes(dat)$formula)[3], "[|]")) asc.vars <- str_replace_all(unlist(strsplit(form.chr[1], "[+]")), " ", "") is.vars <- str_replace_all(unlist(strsplit(form.chr[2], "[+]")), " ", "") asv.vars <- str_replace_all(unlist(strsplit(form.chr[3], "[+]")), " ", "") as.vars <- c(asc.vars, asv.vars) var.vec <- c(asc.vars, is.vars, asv.vars)

############################################ ############# Case 0: evaluation of dataset / AAP & AME ############################################

Adjusted Prediction for every row of the original dataset

Calculation is passed to mlogit

############################################ ############# Case 1: datagrid=() / APM & MEM ############################################

###############

Some comments:

Actually calculation AMP & MEM just makes sense when calculating means

conditioned on each alternative.

Otherwise You just study the effect of the intercepts. Generally this is

also relevant for case 2, see below ###############

Case 1.1: Global meandat.glob.mean

dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset)))

temp.newdata <- dat.glob.mean

Case 1.2: Mean conditioned on choice

dat.choice.mean <- dat %>% group_by(altset.vec) %>% summarise(across(c(paste(var.vec)), mean) )%>% select(!altset.vec)

temp.newdata <- dat.choice.mean

############################################ ############# Case 2: datagrid=(vars.control=number.value) ############################################

###############

Some comments:

The routine I propose here should be sensible for the number of

alternatives (as an option)

and thus replace the value for the regressor variable for each alternative

in the following fashion:

replace wait with 20

mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode) nd <- TravelMode[1:4, c("wait", "gcost", "income", "size")]

wait gcost income size

1 20 70 35 1

2 34 71 35 1

3 35 70 35 1

4 0 30 35 1

5 69 70 35 1

6 20 71 35 1

7 35 70 35 1

8 0 30 35 1

9 69 70 35 1

10 34 71 35 1

11 20 70 35 1

12 0 30 35 1

....

Thus calculating the adjusted probs or marginal effects would deliver four

rows:

Probality if wait for air=20, if wait for train = 20, …; while holding

the rest of the data fixed.

Case 2.1.1: Global mean

dat.choice.rep.val.glob.mean <- dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset))) %>% mutate(var.rep.val = rep(number.value, length(altset))) %>% sapply(., rep.int, times=length(altset)) %>% data.frame(.) %>% mutate(altset.vec = as.character(rep(altset, length=dim(.)[1])) ) %>% group_by(altset.vec) %>% nest()

temp.newdata <- NULL for (i in 1:length(altset)){ dat.choice.rep.val.glob.mean[[2]][[i]][i, paste(vars.control)] <- dat.choice.rep.val.glob.mean[[2]][[i]][i," var.rep.val "] temp.newdata <- rbind(temp.newdata, dat.choice.rep.val.glob.mean[[2]][[i]] ) }

Case 2.1.2: Mean conditioned on choice

easily adaptable

Case 2.1.3: Tukeys 5

easily adaptable

Case 2.2: Group of vars to be controlled, combination of factors,

contrasts

probably adaptable

Case 2.3: datagrid(..., type=counterfactual)

probably adaptable by using replacement of values likewise to Case 2.1

############################################ ############# Control for exact mapping of rows to order of choice set ############################################

A check for altset.vec and idx to secure correct order of newdata would

not be very complicated idx <- paste0(rep(altset, length=dim(temp.newdata)[1]), ":", sort(rep(seq(dim(temp.newdata)[1]/length(altset)), length = dim(temp.newdata)[1] ) ) )

############################################ ############# Calculation ############################################

###############

Some comments:

In Your update I ve seen that You pass the computation to mlogit function

newdata <- cbind(idx, temp.altset.vec, temp.newdata)

pr.pred <- predict(model, newdata) me.pred <- effects(model, newdata, type='aa')

############################################ ############# Passing results to other marginaleffects functions ############################################

plot, contrasts, etc

Must be possible to format results in the marginaleffects fashion

Am Di., 3. Mai 2022 um 21:46 Uhr schrieb Vincent Arel-Bundock < @.***>:

I'm not sure how that’s possible. For example, this newdata does not have any information about the index, yet it still produces predictions, and we have no way of knowing which individual they refer to:

library("AER")

library("mlogit")

data("TravelMode", package = "AER")

mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode)

nd <- TravelMode[1:4, c("wait", "gcost", "income", "size")]

predict(mod, newdata = nd)

> air train bus car

> 0.1693768 0.2727905 0.1783145 0.3795182

Moreover, if we just change the row ordering we get completely different results, and we have no way to verify the row order, since there is no no index in newdata:

nd <- nd[c(2, 1, 3, 4),]

predict(mod, newdata = nd)

> air train bus car

> 0.911253079 0.001247517 0.027969698 0.059529706

Of course, there is an index stored inside the model object:

mod$model$idx

> ~ indexes ~~

> individual mode

> 1 1 air

> 2 1 train

> 3 1 bus

> 4 1 car

> 5 2 air

> 6 2 train

> 7 2 bus

> 8 2 car

> 9 3 air

> 10 3 train

> indexes: 1, 2

But my point is that we have no way of matching that index to an arbitrary newdata supplied by the user. And I was unable to find where the column names for the original index are stored in the model object.

— Reply to this email directly, view it on GitHub https://github.com/vincentarelbundock/marginaleffects/issues/159#issuecomment-1116497005, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYXRKOMLKUCO55CHA37KJMLVIF7CXANCNFSM5KNZIWNQ . You are receiving this because you were mentioned.Message ID: @.***>

vincentarelbundock commented 2 years ago

Are you able to write on Github directly? The formatting via email mixes text and code and makes things very difficult to read.

I have a feeling that you are making things way more complicated than they need to be.

  1. All the functions already work for the original used to fit the model.
  2. All we need to make the newdata argument work is a function that returns TRUE if the newdata that user supplies strictly conforms to expectations, and FALSE otherwise. I don’t know how to write that function, and I don’t think we can assume that every user of mlogit uses insight::get_data.

To use the currently working beta version, install from Github:

library(remotes)
install_github("vincentarelbundock/marginaleffects")

RESTART R completely, and then try this:

library("AER")
library("mlogit")
data("TravelMode", package = "AER")

mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode)
marginaleffects(mod, variables = "income") |> summary()
#> Average marginal effects 
#>   Group   Term    Effect Std. Error z value  Pr(>|z|)      2.5 %    97.5 %
#> 1   air income  0.002786   0.001218  2.2878  0.022149  0.0003992  0.005172
#> 2   bus income -0.000372   0.001103 -0.3373  0.735896 -0.0025337  0.001790
#> 3   car income  0.003373   0.001373  2.4561  0.014045  0.0006814  0.006065
#> 4 train income -0.005787   0.001319 -4.3859 1.155e-05 -0.0083727 -0.003201
#> 
#> Model type:  mlogit 
#> Prediction type:  response

I temporarily removed the block on newdata to illustrate that things almost work already. Here are the marginal effects and adjusted predictions at the mean:

nd <- datagrid(mode = TravelMode$mode, newdata = TravelMode)
marginaleffects(mod, newdata = nd, variables = "income")
#>   rowid     type group   term          dydx    std.error  statistic     p.value      conf.low     conf.high individual choice     wait    vcost   travel    gcost
#> 1     1 response   air income  6.656530e-03 2.426455e-03  2.7433152 0.006082228  1.900766e-03  1.141229e-02          1     no 34.58929 47.76071 486.1655 110.8798
#> 2     2 response train income -5.521732e-03 1.910701e-03 -2.8898981 0.003853667 -9.266638e-03 -1.776826e-03          1     no 34.58929 47.76071 486.1655 110.8798
#> 3     3 response   bus income -1.141280e-03 9.454854e-04 -1.2070837 0.227399906 -2.994397e-03  7.118373e-04          1     no 34.58929 47.76071 486.1655 110.8798
#> 4     4 response   car income  6.482222e-06 2.031464e-05  0.3190912 0.749657342 -3.333374e-05  4.629818e-05          1     no 34.58929 47.76071 486.1655 110.8798
#>     income     size  mode
#> 1 34.54762 1.742857   air
#> 2 34.54762 1.742857 train
#> 3 34.54762 1.742857   bus
#> 4 34.54762 1.742857   car

predictions(mod, newdata = nd)
#>   rowid     type group   predicted   std.error individual choice     wait    vcost   travel    gcost   income     size  mode
#> 1     1 response   air 0.831762528 0.053985678          1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857   air
#> 2     2 response train 0.107734987 0.037727504          1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 train
#> 3     3 response   bus 0.058853854 0.021442837          1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857   bus
#> 4     4 response   car 0.001648631 0.001116992          1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857   car

Again, all we would need is a single function that can be 100% sure that an arbitrary data frame supplied to the newdata argument conforms to our specifications.

NorbertSch101 commented 2 years ago

Hey Vince,

here a longer reply to Your Emails:

####### Your Email yesterday: But my point is that we have no way of matching that index to an arbitrary newdata supplied by the user. And I was unable to find where the column names for the original index are stored in the model object. Moreover, if we just change the row ordering we get completely different results, and we have no way to verify the row order, since there is no no index in newdata: nd <- nd[c(2, 1, 3, 4),]

For sure, You are completely right! It is just allowed to rotate the order of the N´s, but not the k´s within N. But this is the same for all structured long data like panel data etc... Users of mlogit shoulda know: The idx var is construced by dfidx or pivot_longer: There you have to assign the columns for identification of the observations plus the choice / outcome variable in order to yield a long dataframe with N*k rows (k=number of alternatives).

Actually it is possible to decompose the idx of the original data, yet mostly not relevant: Perhaps I am too optimistic, but I believe, that most of the applications of marginaleffects include a model and with its data (insight::). Thus the intended calculations are build on the correct data structure, especially when the datagrid function consequently expands the data by strictly considering the alternative vector as I have tried to fulfill with the procedures in the below, which work with altset and altset.vec

dat <- insight::get_data(model) %>% mutate(altset.vec = rep(levels(dat$idx$id2), length=dim(dat)[1])) altset <- levels(dat$idx$id2) ### Just categories of depvar y

Below I wrote some lines, which enables the possibility to check this. (Control for exact mapping of rows to order of choice set) Maybe this could also be done via a customized sanitize function for mlogit. (Though I believe that won´t be necessary)

I really would like to give You a more solid statement and yday I studied Your code in detail (predictions, datagrid, marginaleffects). I mostly understood the procedure, but there are two pieces missing:

1) I was not able to find the code for the prep_datagrid, even though I googled it. Somehow with this function a variable 'at' is constructed, which is used extensively in the following. Here the lines, I am talking about:

tmp <- prep_datagrid(..., model = model, newdata = newdata) at <- tmp$at dat <- tmp$newdata variables_all <- tmp$all variables_manual <- tmp$variables_manual variables_automatic <- tmp$variables_automatic

at -> data.frame at <- lapply(at, unique) at <- expand.grid(at, stringsAsFactors = FALSE)

2) I was not able to update the package nor a copy of the new functions brought the ability to handle mlogit objects. I copied the sanitize model routine, too. Is there a link to the beta version???

I hope, I don´t ask silly questions, but the truth is, I am by far not so advanced in coding efficiently than You. I use github for the first time and hopefully in the evening I will find some time to open a repos and load some data, where I can give You the data and model of my application.

CU, Norbert

############################################ ############# Get the formula object to get independent vars ############################################

useful for checkmate, sanitize, expanding datagrid for represantive values and counterfactuals

There has to be a check and exklusion of "0", "1", or "-1",.. could write that, when neccessary

var.exkl.form <- c("-1", "0", "1")

form.chr <- unlist(strsplit(as.character(attributes(dat)$formula)[3], "[|]")) asc.vars <- str_replace_all(unlist(strsplit(form.chr[1], "[+]")), " ", "") is.vars <- str_replace_all(unlist(strsplit(form.chr[2], "[+]")), " ", "") asv.vars <- str_replace_all(unlist(strsplit(form.chr[3], "[+]")), " ", "") as.vars <- c(asc.vars, asv.vars) var.vec <- c(asc.vars, is.vars, asv.vars)

############################################ ############# Case 0: evaluation of dataset / AAP & AME ############################################

Adjusted Prediction for every row of the original dataset

Calculation is passed to mlogit

############################################ ############# Case 1: datagrid=() / APM & MEM ############################################

###############

Some comments:

Actually calculation AMP & MEM just makes sense when calculating means conditioned on each alternative.

Otherwise You just study the effect of the intercepts. Generally this is also relevant for case 2, see below

###############

Case 1.1: Global meandat.glob.mean

dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset)))

temp.newdata <- dat.glob.mean

Case 1.2: Mean conditioned on choice

dat.choice.mean <- dat %>% group_by(altset.vec) %>% summarise(across(c(paste(var.vec)), mean) )%>% select(!altset.vec)

temp.newdata <- dat.choice.mean

############################################ ############# Case 2: datagrid=(vars.control=number.value) ############################################

###############

Some comments:

The routine I propose here should be sensible for the number of alternatives (as an option)

and thus replace the value for the regressor variable for each alternative in the following fashion:

replace wait with 20

mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode) nd <- TravelMode[1:4, c("wait", "gcost", "income", "size")]

wait gcost income size

1 20 70 35 1

2 34 71 35 1

3 35 70 35 1

4 0 30 35 1

5 69 70 35 1

6 20 71 35 1

7 35 70 35 1

8 0 30 35 1

9 69 70 35 1

10 34 71 35 1

11 20 70 35 1

12 0 30 35 1

....

Thus calculating the adjusted probs or marginal effects would deliver four rows:

Probality if wait=20 for air, if wait=20 for train, …; while holding the rest of the data fixed.

Case 2.1.1: Global mean

dat.choice.rep.val.glob.mean <- dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset))) %>% mutate(var.rep.val = rep(number.value, length(altset))) %>% sapply(., rep.int, times=length(altset)) %>% data.frame(.) %>% mutate(altset.vec = as.character(rep(altset, length=dim(.)[1])) ) %>% group_by(altset.vec) %>% nest()

temp.newdata <- NULL for (i in 1:length(altset)){ dat.choice.rep.val.glob.mean[[2]][[i]][i, paste(vars.control)] <- dat.choice.rep.val.glob.mean[[2]][[i]][i," var.rep.val "] temp.newdata <- rbind(temp.newdata, dat.choice.rep.val.glob.mean[[2]][[i]] ) }

Case 2.1.2: Mean conditioned on choice

easily adaptable

Case 2.1.3: Tukeys 5

easily adaptable

Case 2.2: Group of vars to be controlled, combination of factors, contrasts

probably adaptable

Case 2.3: datagrid(..., type=counterfactual)

probably adaptable by using replacement of values likewise to Case 2.1

############################################ ############# Control for exact mapping of rows to order of choice set ############################################

A check for altset.vec and idx to secure correct order of newdata would not be very complicated

idx <- paste0(rep(altset, length=dim(temp.newdata)[1]), ":", sort(rep(seq(dim(temp.newdata)[1]/length(altset)), length = dim(temp.newdata)[1] ) ) )

############################################ ############# Calculation ############################################

###############

Some comments:

In Your update I ve seen that You pass the computation to mlogit function

newdata <- cbind(idx, temp.altset.vec, temp.newdata)

pr.pred <- predict(model, newdata) me.pred <- effects(model, newdata, type='aa')

############################################ ############# Passing results to other marginaleffects functions ############################################

plot, contrasts, etc

Must be possible to format results in the marginaleffects fashion

NorbertSch101 commented 2 years ago

Hey Vince, here my reply to ur post from yday. In the beginning a short comment to ur question. Below I used the dataset and model to illustrate the preliminary code from yesterday. See below Please comment also the rest of the application. Sorry for the different type style. Why does this happen, even if I did not mark sth to bold or highlighten certain statemens...

Best,

Norbert

library("AER") library("mlogit") library(marginaleffects) library(stringi) library(tidyverse) library(texreg) library(ggeffects)

data("TravelMode", package = "AER") mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode) summary(mod) marginaleffects(mod, variables = "income") |> summary()

nd <- datagrid(mode = TravelMode$mode, newdata = TravelMode) marginaleffects(mod, newdata = nd, variables = "income") predictions(mod, newdata = nd)

Again, all we would need is a single function that can be 100% sure that an arbitrary data frame supplied to the newdata argument conforms to our specifications.

If someone use the alternative variable (here in the data: "mode"), the expanse function automatically does this. Either a warning via a if condition ("marginaleffects for mlogit requires the dep var", when condition =FALSE) or a automatic inclusion of the altset (as I have inserted in the code from yesterday) will definetely solve this...

#######

######## dat <- insight::get_data(mod) dat <-dat %>% mutate(altset.vec = rep(levels(dat$idx$mode), length=dim(dat)[1])) altset <- levels(dat$idx$mode) ### Just categories of depvar y

Identifier missing for id2

form.chr <- unlist(strsplit(as.character(attributes(dat)$formula)[3], "[|]")) asc.vars <- str_replace_all(unlist(strsplit(form.chr[1], "[+]")), " ", "") is.vars <- str_replace_all(unlist(strsplit(form.chr[2], "[+]")), " ", "") asv.vars <- str_replace_all(unlist(strsplit(form.chr[3], "[+]")), " ", "") asv.vars <- NULL as.vars <- c(asc.vars, asv.vars) var.vec <- c(asc.vars, is.vars, asv.vars)

Case 1.1: Global mean

dat.glob.mean is your nd

dat.glob.mean <- dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset)))

marginaleffects(mod, newdata = nd, variables = "income") marginaleffects(mod, newdata = dat.glob.mean, variables = "income")

BUT: As we see, cost and wait is meaned all over the alternatives: A high cost alternative like air would affect the costs of the others. This could be critical / implausible in applications...

Case 1.2: Mean conditioned on choice

Also works

dat.choice.mean <- dat %>% group_by(altset.vec) %>% summarise(across(c(paste(var.vec)), mean) ) %>% select(!altset.vec) dat.choice.mean

Since there is no variation in indiv.spec. vars, the means for them remain the same, but the different cost and wait associated to each alternative is seperately meaned

marginaleffects(mod, newdata = dat.choice.mean, variables = "income")

############################################ ############# Case 2: datagrid=(vars.control=number.value) ############################################

How does datagrid behave, if I set the indiv.spec variable income to 50?

number.value=50 nd1 <- datagrid(mode = TravelMode$mode, income=number.value, newdata = TravelMode) marginaleffects(mod, newdata = nd1, variables = "income")

Works fine: Since income is a individual specific var, there should be no variation within the alternative

How does datagrid behave, if I want to set the alt.spec variable cost to 30 for train?

number.value=30

nd2 <- datagrid(mode = TravelMode$mode, gcost=c(30), newdata = TravelMode) marginaleffects(mod, newdata = nd2, variables = "income")

The margins are computed with 30 for all alternatives

However it could be interesting, when only one alternative is affected by an cost reduction. In my application for instance, it is also interesting to see what is happening, if the regressor takes alternativewise a defined value, while holding the values for the other alternatives at their mean:

vars.control <- c("gcost")

dat.choice.rep.val.glob.mean <- dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset))) %>% mutate(var.rep.val = rep(number.value, length(altset))) %>% sapply(., rep.int, times=length(altset)) %>% data.frame(.) %>% mutate(altset.vec = as.character(rep(altset, length=dim(.)[1])) ) %>% group_by(altset.vec) %>% nest()

temp.newdata <- NULL for (i in 1:length(altset)){ dat.choice.rep.val.glob.mean[[2]][[i]][i, paste(vars.control)] <- dat.choice.rep.val.glob.mean[[2]][[i]][i,"var.rep.val"] temp.newdata <- rbind(temp.newdata, dat.choice.rep.val.glob.mean[[2]][[i]] ) }

Final preparation with idx and and vector for altset, to see if the data manipulation works in the expected way

temp.newdata <- temp.newdata %>% mutate(altset.vec.nd = rep(altset, length=dim(temp.newdata)[1]))

idx.nd <- paste0(rep(altset, length=dim(temp.newdata)[1]), ":", sort(rep(seq(dim(temp.newdata)[1]/length(altset)), length = dim(temp.newdata)[1] ) ) )

nd3 <- cbind(idx.nd, temp.newdata)

gcost is now varying for each var, while the others are fixed to global mean

marginaleffects(mod, newdata = nd3, variables = vars.control)

Due to the nonlinearity also the margins are affected (plus IIA restriction)

Adapting the mean procedure to computation for alternativewise mean

dat.choice.rep.val.choice.mean <- dat %>% group_by(altset.vec) %>% summarise(across(c(paste(var.vec)), mean) )%>% select(!altset.vec) %>% mutate(var.rep.val = rep(number.value, length(altset))) %>% sapply(., rep.int, times=length(altset)) %>% data.frame(.) %>% mutate(altset.vec = as.character(rep(altset, length=dim(.)[1])) ) %>% group_by(altset.vec) %>% nest()

temp.newdata <- NULL for (i in 1:length(altset)){ dat.choice.rep.val.choice.mean[[2]][[i]][i, paste(vars.control)] <- dat.choice.rep.val.choice.mean[[2]][[i]][i,"var.rep.val"] temp.newdata <- rbind(temp.newdata, dat.choice.rep.val.choice.mean[[2]][[i]] ) }

temp.newdata <- temp.newdata %>% mutate(altset.vec.nd = rep(altset, length=dim(temp.newdata)[1]))

idx.nd <- paste0(rep(altset, length=dim(temp.newdata)[1]), ":", sort(rep(seq(dim(temp.newdata)[1]/length(altset)), length = dim(temp.newdata)[1] ) ) )

nd4 <- cbind(idx.nd, temp.newdata)

gcost is now varying for each var, while the others are fixed to the alternativewise mean

marginaleffects(mod, newdata = nd4, variables = vars.control)

Due to the nonlinearity also the margins are affected (plus IIA restriction)

vincentarelbundock commented 2 years ago

For formatting, please see this link:

https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax#quoting-code

In particular, note that (1) code blocks must be preceded AND followed by lines with three back-ticks, (2) hashtags denote sections in the markdown syntax.

At a high level, this is what I got from your last post:

  1. The newdata argument works exactly as intended, and allows you to compute the quantities you need.
  2. The datagrid() function does not produce the types of data frames that you think would be useful for mlogit models.

I don’t want to be too glib, but maybe you should just avoid datagrid… That function was conceived as a shortcut to construct a specific type of dataset that is useful for many models: 60 of the 61 model types currently supported by marginaleffects, with the possible exception of mlogit.

But the whole point of the interface design is that newdata can accept arbitrary data frames constructed by users. There is nothing forcing you to use datagrid() to build those data frames.

In fact, all the examples you give in your post seem trivial to build. And since they are so trivial, maybe the best course of action is not to add a ton of code complexity to marginaleffects, but instead to write an example vignette, showing super simple code that replicates some of your examples:

library("AER")
library("mlogit")
library("tidyverse")
data("TravelMode", package = "AER")
mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode)

# Case 1.1: Global mean
TravelMode %>%
    summarize(across(c("wait", "gcost", "income", "size"),
              function(x) rep(mean(x), 4)))
#>       wait    gcost   income     size
#> 1 34.58929 110.8798 34.54762 1.742857
#> 2 34.58929 110.8798 34.54762 1.742857
#> 3 34.58929 110.8798 34.54762 1.742857
#> 4 34.58929 110.8798 34.54762 1.742857

# Case 1.2: Mean conditioned on choice
TravelMode %>%
    group_by(mode) %>%
    summarize(across(c("wait", "gcost", "income", "size"), mean))
#> # A tibble: 4 × 5
#>   mode   wait gcost income  size
#>   <fct> <dbl> <dbl>  <dbl> <dbl>
#> 1 air    61.0 103.    34.5  1.74
#> 2 train  35.7 130.    34.5  1.74
#> 3 bus    41.7 115.    34.5  1.74
#> 4 car     0    95.4   34.5  1.74

# Only one alternative is affected by cost reduction
nd3 <- datagrid(mode = TravelMode$mode, newdata = TravelMode)
nd3 <- lapply(1:4, function(i) mutate(nd3, gcost = ifelse(1:4 == i, 30, gcost)))
bind_rows(nd3)
#>    individual choice     wait    vcost   travel    gcost   income     size  mode
#> 1           1     no 34.58929 47.76071 486.1655  30.0000 34.54762 1.742857   air
#> 2           1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 train
#> 3           1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857   bus
#> 4           1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857   car
#> 5           1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857   air
#> 6           1     no 34.58929 47.76071 486.1655  30.0000 34.54762 1.742857 train
#> 7           1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857   bus
#> 8           1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857   car
#> 9           1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857   air
#> 10          1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 train
#> 11          1     no 34.58929 47.76071 486.1655  30.0000 34.54762 1.742857   bus
#> 12          1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857   car
#> 13          1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857   air
#> 14          1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 train
#> 15          1     no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857   bus
#> 16          1     no 34.58929 47.76071 486.1655  30.0000 34.54762 1.742857   car
NorbertSch101 commented 2 years ago

Hey Vince,

thx for Your referral to formatting Will keep that in mind. Secondly it was helpful to see how You code the setup of the newdata. I will use functionals more often than before and I am updating my R programming skills with Advanced R (H. Wickham). I learned R autodidactical and had a break the last two years, in which a lot progress has happening (tidyverse, purr...). Worthful to exploit...

You propose to bypass the datagrid thing by an example vignette. Surely, this might be an option. Even though it is my impression that with small refinements datagrid would also work for mlogit objects, if the expansion strictly relies on the idx of the mlogit object. As I have written, I was not able to understand every step of the datagrid function because I neither find prep_datagrid on Your github nor in R or in the web. So I just had only an impression that a fully incorporation of mlogit in the marginaleffects workflow would not go substantially beyond the code design. To my opinion, the more models share the same procedure and workflow, the better it is. Maybe there can be a mixed strategy. Vignette in the short run, full implementation in the long run.

Eventually it is up to You. N

vincentarelbundock commented 2 years ago

Yeah, learning R feels like a life long endeavor. The good news is that it's fun, isn't it?!

I added a short vignette with some of the examples you supplied: https://vincentarelbundock.github.io/marginaleffects/articles/mlogit.html

Will keep thinking about this. Feel free to open an issue on Github if you have specific feature requests for datagrid().

NorbertSch101 commented 2 years ago

Jup, da core. I love the data analyses stuff with R very much! Moreover, the whole project is a wonderful thing. Open-Source, powerful and stable, developement by many people taking the efforts without monetary reward. Isn´t it amazing? We are talking about collective action on a public good. What a pity, that Mancur Olson has missed this...

Specific feature requests for datagrid... I am not sure, what do you mean exactly: My question regarding prep_datagrid? Will do so, but before I open an issue, I better check if I understood you right.. Bye

vincentarelbundock commented 2 years ago

I guess I'd like to know exactly what is the structure of all the datasets you would like to create with datagrid(). Can't guarantee I'll be able to make them all, but we can try...

NorbertSch101 commented 2 years ago

Hey,

You mean Yves Croissant? Nice! It will be interesting how he will implement a way of dealing with row-wise deletion within the choicesets.

However, my major problem of calculating marginal effects has been linked to newdata. Esp the data preparation built by means for some independent vars while varying other independents vars of interest and mapping rows ot fhe prepared data to the correct idx needed a custom made solution. My impression was that this is more than just case specific due to my research questions.

I will check if the routine in ur package works in the necessary way for mlogit but as I remember I tried code snippets of marginaleffects too. I think on Tuesday I can figure something out and will let you know.

Have a nice weekend,

Norbert

Am Fr., 22. Apr. 2022 um 13:13 Uhr schrieb Vincent Arel-Bundock < @.***>:

@NorbertSch101 https://github.com/NorbertSch101 FYI, I have been in touch with the mlogit developer and he seems interested in looking for an upstream fix. I think we should wait a bit before diving too deep into this.

— Reply to this email directly, view it on GitHub https://github.com/vincentarelbundock/marginaleffects/issues/159#issuecomment-1106410619, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYXRKOOOR3FEBEAVVGYWJIDVGKCWPANCNFSM5KNZIWNQ . You are receiving this because you were mentioned.Message ID: @.***>

NorbertSch101 commented 2 years ago

Ah, I just saw that u are in the political science. My application basically deals with voting behavior data and spatial modeling and so on

Holen Sie sich Outlook für Androidhttps://aka.ms/AAb9ysg


From: Norbert Schoening @.> Sent: Wednesday, May 4, 2022 5:59:26 PM To: vincentarelbundock/marginaleffects @.> Subject: Re: [vincentarelbundock/marginaleffects] Support: mlogit (Issue #159)

Hey Vince,

here a longer reply to Your Emails

############################# ####### Your Email yesterday: #############################

But my point is that we have no way of matching that index to an arbitrary newdata supplied by the user.

And I was unable to find where the column names for the original index are stored in the model object.

Moreover, if we just change the row ordering we get completely different results,

and we have no way to verify the row order, since there is no no index in newdata:

nd <- nd[c(2, 1, 3, 4),]

For sure, You are completely right! It is just allowed to rotate the order of the N´s, but not the k´s within N.

But this is the same for all structured long data like panel data etc... Users of mlogit shoulda know

The idx var is construced by dfidx or pivot_longer: There you have to assign the columns

for identification of the observations plus the choice / outcome variable in order to yield a long dataframe with N*k rows (k=number of alternatives)

Generally it would be possible to decompose the idx of the original data.

Perhaps I am too optimistic, but I believe that most of the applications of marginaleffects include a model and with its data (insight::)

the intended calculations are build on the correct data structure, especially when the datagrid function consequently expands the data by strictly

consider the alternative vector as I have tried to fulfill with the procedures in the below, which work with altset and altset.vec

dat <- insight::get_data(model) %>% mutate(altset.vec = rep(levels(dat$idx$id2), length=dim(dat)[1])) altset <- levels(dat$idx$id2) ### Just categories of depvar y

Below I wrote some lines, which enables the possibility to check this. (Control for exact mapping of rows to order of choice set)

Maybe this could also be done via a customized sanitize function for mlogit. (Though I believe that won´t be necessary)

I really would like to give You a more solid statement and yday I studied Your code in detail (predictions, datagrid, marginaleffects).

I mostly understood the procedure, but there are two pieces missing:

1.:I was not able to find the code for the prep_datagrid, even though I googled it.

Somehow with this function a variable 'at' is constructed, which is used extensively in the following. Here the lines, I am talking about:

tmp <- prep_datagrid(..., model = model, newdata = newdata) at <- tmp$at dat <- tmp$newdata variables_all <- tmp$all variables_manual <- tmp$variables_manual variables_automatic <- tmp$variables_automatic

at -> data.frame

at <- lapply(at, unique) at <- expand.grid(at, stringsAsFactors = FALSE)

2.: I was not able to update the package nor a copy of the new functions brought the ability to handle mlogit objects.

I copied the sanitize model routine, too. Is there a link to the beta version???

I hope, I don´t ask silly questions, but the truth is, I am by far not so advanced in coding efficiently than You.

I use github for the first time and hopefully in the evening I will find some time to open a repos and load some data, where I can

give You the data and model of my application.

CU,

Norbert

############################################ ############# Get the formula object to get independent vars ############################################

useful for checkmate, sanitize, expanding datagrid for represantive values and counterfactuals

There has to be a check and exklusion of "0", "1", or "-1",.. could write that, when neccessary

var.exkl.form <- c("-1", "0", "1")

form.chr <- unlist(strsplit(as.character(attributes(dat)$formula)[3], "[|]")) asc.vars <- str_replace_all(unlist(strsplit(form.chr[1], "[+]")), " ", "") is.vars <- str_replace_all(unlist(strsplit(form.chr[2], "[+]")), " ", "") asv.vars <- str_replace_all(unlist(strsplit(form.chr[3], "[+]")), " ", "") as.vars <- c(asc.vars, asv.vars) var.vec <- c(asc.vars, is.vars, asv.vars)

############################################ ############# Case 0: evaluation of dataset / AAP & AME ############################################

Adjusted Prediction for every row of the original dataset

Calculation is passed to mlogit

############################################ ############# Case 1: datagrid=() / APM & MEM ############################################

###############

Some comments:

Actually calculation AMP & MEM just makes sense when calculating means conditioned on each alternative.

Otherwise You just study the effect of the intercepts. Generally this is also relevant for case 2, see below

###############

Case 1.1: Global meandat.glob.mean

dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset)))

temp.newdata <- dat.glob.mean

Case 1.2: Mean conditioned on choice

dat.choice.mean <- dat %>% group_by(altset.vec) %>% summarise(across(c(paste(var.vec)), mean) )%>% select(!altset.vec)

temp.newdata <- dat.choice.mean

############################################ ############# Case 2: datagrid=(vars.control=number.value) ############################################

###############

Some comments:

The routine I propose here should be sensible for the number of alternatives (as an option)

and thus replace the value for the regressor variable for each alternative in the following fashion:

replace wait with 20

mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode) nd <- TravelMode[1:4, c("wait", "gcost", "income", "size")]

wait gcost income size

1 20 70 35 1

2 34 71 35 1

3 35 70 35 1

4 0 30 35 1

5 69 70 35 1

6 20 71 35 1

7 35 70 35 1

8 0 30 35 1

9 69 70 35 1

10 34 71 35 1

11 20 70 35 1

12 0 30 35 1

....

Thus calculating the adjusted probs or marginal effects would deliver four rows:

Probality if wait for air=20, if wait for train = 20, …; while holding the rest of the data fixed.

Case 2.1.1: Global mean

dat.choice.rep.val.glob.mean <- dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset))) %>% mutate(var.rep.val = rep(number.value, length(altset))) %>% sapply(., rep.inthttp://rep.int, times=length(altset)) %>% data.frame(.) %>% mutate(altset.vec = as.character(rep(altset, length=dim(.)[1])) ) %>% group_by(altset.vec) %>% nest()

temp.newdata <- NULL for (i in 1:length(altset)){ dat.choice.rep.val.glob.mean[[2]][[i]][i, paste(vars.control)] <- dat.choice.rep.val.glob.mean[[2]][[i]][i," var.rep.val "] temp.newdata <- rbind(temp.newdata, dat.choice.rep.val.glob.mean[[2]][[i]] ) }

Case 2.1.2: Mean conditioned on choice

easily adaptable

Case 2.1.3: Tukeys 5

easily adaptable

Case 2.2: Group of vars to be controlled, combination of factors, contrasts

probably adaptable

Case 2.3: datagrid(..., type=counterfactual)

probably adaptable by using replacement of values likewise to Case 2.1

############################################ ############# Control for exact mapping of rows to order of choice set ############################################

A check for altset.vec and idx to secure correct order of newdata would not be very complicated

idx <- paste0(rep(altset, length=dim(temp.newdata)[1]), ":", sort(rep(seq(dim(temp.newdata)[1]/length(altset)), length = dim(temp.newdata)[1] ) ) )

############################################ ############# Calculation ############################################

###############

Some comments:

In Your update I ve seen that You pass the computation to mlogit function

newdata <- cbind(idx, temp.altset.vec, temp.newdata)

pr.pred <- predict(model, newdata) me.pred <- effects(model, newdata, type='aa')

############################################ ############# Passing results to other marginaleffects functions ############################################

plot, contrasts, etc

Must be possible to format results in the marginaleffects fashion

Am Di., 3. Mai 2022 um 21:46 Uhr schrieb Vincent Arel-Bundock @.**@.>>:

I'm not sure how that’s possible. For example, this newdata does not have any information about the index, yet it still produces predictions, and we have no way of knowing which individual they refer to:

library("AER")

library("mlogit")

data("TravelMode", package = "AER")

mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode)

nd <- TravelMode[1:4, c("wait", "gcost", "income", "size")]

predict(mod, newdata = nd)

> air train bus car

> 0.1693768 0.2727905 0.1783145 0.3795182

Moreover, if we just change the row ordering we get completely different results, and we have no way to verify the row order, since there is no no index in newdata:

nd <- nd[c(2, 1, 3, 4),]

predict(mod, newdata = nd)

> air train bus car

> 0.911253079 0.001247517 0.027969698 0.059529706

Of course, there is an index stored inside the model object:

mod$model$idx

> ~ indexes ~~

> individual mode

> 1 1 air

> 2 1 train

> 3 1 bus

> 4 1 car

> 5 2 air

> 6 2 train

> 7 2 bus

> 8 2 car

> 9 3 air

> 10 3 train

> indexes: 1, 2

But my point is that we have no way of matching that index to an arbitrary newdata supplied by the user. And I was unable to find where the column names for the original index are stored in the model object.

— Reply to this email directly, view it on GitHubhttps://github.com/vincentarelbundock/marginaleffects/issues/159#issuecomment-1116497005, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AYXRKOMLKUCO55CHA37KJMLVIF7CXANCNFSM5KNZIWNQ. You are receiving this because you were mentioned.Message ID: @.***>

WITT101 commented 1 year ago

Hi Guys - did anything come of this? I.e is there a way to predict long formatted data using the predict function?

Thanks

vincentarelbundock commented 1 year ago

@WITT101 did you read this vignette?

https://vincentarelbundock.github.io/marginaleffects/articles/mlogit.html#mlogit-package

If that did not answer your question, can you show me a VERY SIMPLE model using a SMALL public dataset in long format on which you would like predictions?

vincentarelbundock commented 1 year ago

What is the code you used to estimate a model on these data?

vincentarelbundock commented 1 year ago

Your sample data is not "balanced" (see vignette). On my computer it looks like the predictions() function works fine in balanced data:

library(mlogit)
library(marginaleffects)

ex <- read.csv("example_predict.csv")

dat <- expand.grid(
    chid.var = unique(ex$chid.var),
    alt.var = unique(ex$alt.var))
dat$choice.id <- sample(c(TRUE, FALSE), nrow(dat), replace = TRUE)
dat$IV_1 <- sample(ex$IV_1, nrow(dat), replace = TRUE)
dat$IV_2 <- sample(ex$IV_2, nrow(dat), replace = TRUE)
dat$IV_3 <- sample(ex$IV_3, nrow(dat), replace = TRUE)

new.data <- mlogit.data(
    data = dat,
    choice = "choice.id",
    chid.var = "chid.var",
    alt.var = "alt.var",
    shape = "long")

mod <- mlogit(
    choice.id ~ IV_1 + IV_2 + IV_3 | 0 | 0,
    data = new.data)

predictions(mod)
DrJerryTAO commented 1 year ago

Hi @vincentarelbundock, glad to know that mlogit is also supported. I used it for commute mode choice modeling. I noticed that you composed a brilliant tutorial at https://vincentarelbundock.github.io/marginaleffects/articles/mlogit.html, but it is not yet indexed in the Articles page https://vincentarelbundock.github.io/marginaleffects/articles/index.html

vincentarelbundock commented 1 year ago

It's indexed as "Categorical outcomes"