Closed vincentarelbundock closed 2 years ago
Hi,
I just wanted to open an issue for asking You if the marginaleffects functions could be adpated to the mlogit model objects (MNL, Nisted-, Heteroscedastic, RE-.MNL, Probit). However, I saw this post and now I guess that that has been already approached.
Obviously I had luck and the predict function is doing what it is expected to do. If You estimate and predict the model with the full Travel-Mode Data, everything is fine. But due to your (alternativewise) deletion the choice set is unbalanced for the first three individuals. So the predict function seemingly does not work, when the choice set is unbalanced.
Maybe the fitted function is useful and allows an adaption for marginaleffects?
I work on Missing-Data currently and just tried predict and fitted by using my own datasets, deletion procedures and Discrete-Choice Models. Here it is the same: While predict does not work on my alternativewise deleted choice data, fitted does.
Best regard,
N
Thanks for the comment and investigation. Would you be able to show me a replicable example of this? Everything I tried fails:
library(mlogit)
library(AER)
data("TravelMode", package = "AER")
dat <- TravelMode[-c(2, 7, 9),]
mod <- mlogit(choice ~ wait + gcost | income + size, dat)
dat2 <- dat
dat2$gcost <- dat2$gcost + 10
predict(mod)
#> Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 2, 0
predict(mod, newdata = dat)
#> Error in predict.mlogit(mod, newdata = dat): the number of rows of the data.frame should be a multiple of the number of alternatives
predict(mod, newdata = dat2)
#> Error in predict.mlogit(mod, newdata = dat2): the number of rows of the data.frame should be a multiple of the number of alternatives
Hey!
I do not know it precisely but I think that mlogit is widely used and it might be a good move if Your package is able to deal with mlogit objects. Thus if You are interested to integrate mlogit into marginaleffects I would appreciate it if we can work together on that. Although I am new on github, I work for three years with mlogit so I have some experience on how the package works and the objects are organized.
The package ggeffect has already a predict routine for the mlogit package. This routine relies on the _datagrid function which organizes a data matrix to yield statistics for some variables to hold constant (at the mean for example) and include variation for those variables one is interested to investigate the marginal effects. Yet the ggeffect predict function does not work, because the required data_grid function fails to manage the data matrix in a proper way for mlogit. If You are interested I can upload a workspace and the script for demonstration purposes. I have already written a function on my own to estimate marginal effects on newdata. It works generally in that way that I define a set of vars, for which I am interested in obtaining marginal effects for a defined variation of their scale values while holding other vars on their mean. I copied the self written function. Though suited for my specific analyses it could be more generally defined.
Regarding Your question the error has to do with the issues of my last message. mlogit is using even for cross section data the long format and this has implications for both estimation and prediction. The long format results from the individuals n=1,..,N and the Choice set k=1,...,K and typically the data matrix contains the number of rows RN=N*K.
There is slightly a difference between the data handling for estimation and prediction: If and only if for every individual there exists a chosen alternative, mlogit is able to estimate models on data matrices smaller than RN. But this does not work for the predict function. Here the full number of rows RN is strictly demanded.
On the one hand this seems to be strange, because for an individual the probabilitiy of alternative k represent the relative share of the exponentiated value X_beta to the sum of the exponentiated X_beta for all alternatives and could be calculated for Choice Sets smaller than K. From this point of view easily a self written function could be installed which calculate in the first the linear predictions for all individuals even with missing alternatives and in the second steps the probabilities.
On the other hand this could substantially introduce bias, especially in small data sets: If You imagine a scenario in which someone wants to predict probs where one half of the has the full choice set (e.g. 10 alternatives) while the other half has only two alternatives. These two alternatives may have in full choice sets small probs but logically in the data part with missing alternatives not. Moreover, how would one calculate Average Marginal Effects? Assigning zeros for the missing alternatives and reduce the average for the values of the full Choice Sets? I assume that this is the reason for the behavior of the predict function.
Kind regards,
Norbert
prob_mefx_mlogit <- function(model, partei, vars, variation) {
##########################################
##########################################
var.dis <- vars[1:4]
var.vec <- vars
dat <- insight::get_data(model)
idx <- paste0(rep(levels(dat$idx$id2), length(variation)), ":",
sort(rep(variation, length(partei))) )
partei.vec <- rep(partei, length=dim(dat)[1])
dat <- cbind(dat, partei.vec)
##########################################
##########################################
partei.mw <- dat %>%
group_by(partei.vec) %>%
summarize_at(vars(var.vec),mean) %>%
select(!partei.vec)
##########################################
##########################################
##########################################
##########################################
for (dis in var.dis){
col.num.var <- which(colnames(partei.mw)==dis)
assign(paste0("partei.mw.prep.", dis) , NULL )
assign(paste0("Tabelle.", dis) , NULL )
##########################################
##########################################
for (part in partei){
row.num.part <- which(partei==part)
assign(paste0(paste0("partei.mw.prep.", dis), ".", part) , NULL )
##########################################
##########################################
for (val in variation){
temp <- partei.mw
temp[row.num.part, col.num.var] <- val
assign(paste0(paste0("partei.mw.prep.", dis), ".", part) ,
rbind(get(paste0(paste0("partei.mw.prep.", dis), ".", part)),temp ) )
}
##########################################
##########################################
assign(paste0(paste0("partei.mw.prep.", dis), ".", part),
cbind(get(paste0(paste0("partei.mw.prep.", dis), ".", part)), idx ) )
assign(paste0(paste0("prdat.", dis), ".", part),
predict(model, newdata= get(paste0(paste0("partei.mw.prep.", dis), ".", part)) ) )
assign(paste0(paste0("pr.me.val.", dis), ".", part),
round(cbind( get(paste0(paste0("prdat.", dis), ".", part))[,row.num.part],
get(paste0(paste0("prdat.", dis), ".", part))[,row.num.part]*coef(model)[dis])
, digits=3) )
assign(paste0("Tabelle.", dis),
cbind(get(paste0("Tabelle.", dis)),
get(paste0(paste0("pr.me.val.", dis), ".", part))) )
}
##########################################
##########################################
assign(paste0("Tabelle.", dis), cbind(variation, get(paste0("Tabelle.", dis)) ) )
##########################################
##########################################
}
##########################################
##########################################
##########################################
##########################################
nam.wsk = paste0("WSK ", partei)
nam.me =paste0("ME ", partei)
comb.vec <- integer(2*length(nam.wsk))
comb.vec1 <- 2L * seq_along(nam.wsk)
comb.vec[comb.vec1] <- nam.me
comb.vec[comb.vec1 - 1L] <- nam.wsk
colname.nam <- c("Distanz", comb.vec)
colnames(Tabelle.disLR) <- colname.nam
colnames(Tabelle.disSOE) <- colname.nam
colnames(Tabelle.disLA) <- colname.nam
colnames(Tabelle.disKE) <- colname.nam
rowname.lr <- paste0("LR: ", variation)
rowname.soe <- paste0("SÖ: ", variation)
rowname.la <- paste0("LA: ", variation)
rowname.ke <- paste0("KE: ", variation)
rownames(Tabelle.disLR) <- rowname.lr
rownames(Tabelle.disSOE) <- rowname.soe
rownames(Tabelle.disLA) <- rowname.la
rownames(Tabelle.disKE) <- rowname.ke
fit <- list(LR = Tabelle.disLR, SOE = Tabelle.disSOE,
LA = Tabelle.disLA, KE = Tabelle.disKE)
##############
}
Hi, I tried to use the prediction function in ggeffects:::get_predictions_mlogit
but even that doesn't seem to work on the models shown above.
If you can find or write a predict()
function that accepts (1) a mlogit
model and (2) a custom data.frame over which we want to make predictions, I can try to make this work in marginaleffects
. However, I don't know this specific procedure well enough to design a predict
function myself (and I don't have the time to learn it).
But again, if you can supply a predict function in which you are very confident, I'll be happy to work with you to see if I can make it work in marginaleffects
.
The other option would be to contact the developers and show that the current predict
function is broken, and ask if they might be willing to release a fix.
Hey there.
Nice, let us try to figure this out! I will open a repository where we can work on that. Just give me two or three days to organize the repos
Best,
Norbert
Am Di., 19. Apr. 2022 um 22:34 Uhr schrieb Vincent Arel-Bundock < @.***>:
Hi, I tried to use the prediction function in ggeffects:::get_predictions_mlogit but even that doesn't seem to work on the models shown above.
If you can find or write a predict() function that accepts (1) a mlogit model and (2) a custom data.frame over which we want to make predictions, I can try to make this work in marginaleffects. However, I don't know this specific procedure well enough to design a predict function myself (and I don't have the time to learn it).
But again, if you can supply a predict function in which you are very confident, I'll be happy to work with you to see if I can make it work in marginaleffects.
The other option would be to contact the developers and show that the current predict function is broken, and ask if they might be willing to release a fix.
— Reply to this email directly, view it on GitHub https://github.com/vincentarelbundock/marginaleffects/issues/159#issuecomment-1103121024, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYXRKOOJTWWWRPSNNTX2LCTVF4KFTANCNFSM5KNZIWNQ . You are receiving this because you commented.Message ID: @.***>
@NorbertSch101 FYI, I just emailed the mlogit
maintainer. It would be much better if this could be fixed upstream than here. And he might show us that we made an obvious mistake...
@NorbertSch101 FYI, I have been in touch with the mlogit
developer and he seems interested in looking for an upstream fix. I think we should wait a bit before diving too deep into this.
I looked into this a bit more and have identified other challenges to implementation:
mlogit
models have multiple potential input data types, including a special mlogit.data
class, and a "long" format with one row per choice/observation (i.e., multiple rows per observation). I would have to @NorbertSch101
FYI, the development version of marginaleffects
now includes support for mlogit
models. There are a couple limitations that cannot be worked around without changes upstream:
newdata
argument is not supportedchoice ~ wait + gcost | income + size
, the marginal effects will always come up as 0 for wait
and gcost
, because the predict.mlogit
method in the mlogit
package does not produce different predictions when we use newdata
with different values of the predictors before the vertical bar in the formula.Aside from that, things seem to work quite well:
library("AER")
library("mlogit")
library("marginaleffects")
data("TravelMode", package = "AER")
mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode)
mfx <- marginaleffects(mod, variables = c("income", "size"))
summary(mfx)
#> Average marginal effects
#> Group Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
#> 1 air income 0.002786 0.001218 2.2878 0.0221493 0.0003992 0.005172
#> 2 bus income -0.000372 0.001103 -0.3373 0.7358957 -0.0025337 0.001790
#> 3 car income 0.003373 0.001373 2.4561 0.0140448 0.0006814 0.006065
#> 4 train income -0.005787 0.001319 -4.3859 1.155e-05 -0.0083727 -0.003201
#> 5 air size -0.126475 0.028922 -4.3729 1.226e-05 -0.1831612 -0.069788
#> 6 bus size 0.011348 0.025867 0.4387 0.6608765 -0.0393502 0.062046
#> 7 car size 0.045886 0.024754 1.8536 0.0637901 -0.0026319 0.094404
#> 8 train size 0.069241 0.024784 2.7938 0.0052088 0.0206661 0.117816
#>
#> Model type: mlogit
#> Prediction type: response
Hey Vince,
I worked on that too, but didn't finished that yet. I tried to implement with tidyr most of the functions.
Due to my lack of experience in building such general functions I had to trie around. The crucial thing is simply to expand the datagrid function to the long format and manipulate data in a custom fashioned way...
I write to you again asap...
N
Holen Sie sich Outlook für Androidhttps://aka.ms/AAb9ysg
From: Vincent Arel-Bundock @.>
Sent: Tuesday, May 3, 2022 9:03:25 PM
To: vincentarelbundock/marginaleffects @.>
Cc: NorbertSch101 @.>; Mention @.>
Subject: Re: [vincentarelbundock/marginaleffects] Support: mlogit
(Issue #159)
@NorbertSch101https://github.com/NorbertSch101
FYI, the development version of marginaleffects now includes support for mlogit models. There are a couple limitations that cannot be worked around without changes upstream:
Aside from that, things seem to work quite well:
library("AER") library("mlogit") library("marginaleffects") data("TravelMode", package = "AER")
mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode) mfx <- marginaleffects(mod, variables = c("income", "size")) summary(mfx)
— Reply to this email directly, view it on GitHubhttps://github.com/vincentarelbundock/marginaleffects/issues/159#issuecomment-1116458130, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AYXRKOLTFYLCWQVWCV2POU3VIFZ73ANCNFSM5KNZIWNQ. You are receiving this because you were mentioned.Message ID: @.***>
Maybe it would be a good idea to let me know a bit more details on what you are working on. For instance, I will not merge code that requires more dependencies (e.g., tidyr
), and I think we want to be quite careful about adding stuff that should be in upstream packages. I just wouldn't want you to put in a ton of work on something unless it has a good chance of being accepted and merged eventually.
The problem with newdata
is actually not that we need balanced outcome (that is one problem, but a trivial one). The problem with newdata
is that the user may not always supply an individual-choice index that allows us to align the data internally. So we can't know for sure what prediction goes with what row of the original newdata
.
For sure. The it is possible to expand the idx var complete
Holen Sie sich Outlook für Androidhttps://aka.ms/AAb9ysg
From: Vincent Arel-Bundock @.>
Sent: Tuesday, May 3, 2022 9:31:35 PM
To: vincentarelbundock/marginaleffects @.>
Cc: NorbertSch101 @.>; Mention @.>
Subject: Re: [vincentarelbundock/marginaleffects] Support: mlogit
(Issue #159)
Maybe it would be a good idea to let me know a bit more details on what you are working on. For instance, I will not merge code that requires more dependencies (e.g., tidyr), and I think we want to be quite careful about adding stuff that should be in upstream packages. I just wouldn't want you to put in a ton of work on something unless it has a good chance of being accepted and merged eventually.
The problem with newdata is actually not that we need balanced outcome (that is one problem, but a trivial one). The problem with newdata is that the user may not always supply an individual-choice index that allows us to align the data internally. So we can't know for sure what prediction goes with what row of the original newdata.
— Reply to this email directly, view it on GitHubhttps://github.com/vincentarelbundock/marginaleffects/issues/159#issuecomment-1116483413, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AYXRKOOR7WZJNKKG2BXTD23VIF5JPANCNFSM5KNZIWNQ. You are receiving this because you were mentioned.Message ID: @.***>
I'm not sure how that’s possible. For example, this newdata
does not have any information about the index, yet it still produces predictions, and we have no way of knowing which individual they refer to:
library("AER")
library("mlogit")
data("TravelMode", package = "AER")
mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode)
nd <- TravelMode[1:4, c("wait", "gcost", "income", "size")]
predict(mod, newdata = nd)
#> air train bus car
#> 0.1693768 0.2727905 0.1783145 0.3795182
Moreover, if we just change the row ordering we get completely different results, and we have no way to verify the row order, since there is no no index in newdata
:
nd <- nd[c(2, 1, 3, 4),]
predict(mod, newdata = nd)
#> air train bus car
#> 0.911253079 0.001247517 0.027969698 0.059529706
Of course, there is an index stored inside the model object:
mod$model$idx
#> ~~~ indexes ~~~~
#> individual mode
#> 1 1 air
#> 2 1 train
#> 3 1 bus
#> 4 1 car
#> 5 2 air
#> 6 2 train
#> 7 2 bus
#> 8 2 car
#> 9 3 air
#> 10 3 train
#> indexes: 1, 2
But my point is that we have no way of matching that index to an arbitrary newdata
supplied by the user. And I was unable to find where the column names for the original index are stored in the model object.
############################# ####### Your Email yesterday: #############################
newdata supplied by the user.
stored in the model object.
results,
newdata:
order of the N´s, but not the k´s within N.
Users of mlogit shoulda know
assign the columns
in order to yield a long dataframe with N*k rows (k=number of alternatives)
of marginaleffects include a model and with its data (insight::)
especially when the datagrid function consequently expands the data by strictly
procedures in the below, which work with altset and altset.vec
dat <- insight::get_data(model) %>% mutate(altset.vec = rep(levels(dat$idx$id2), length=dim(dat)[1])) altset <- levels(dat$idx$id2) ### Just categories of depvar y
(Control for exact mapping of rows to order of choice set)
mlogit. (Though I believe that won´t be necessary)
Your code in detail (predictions, datagrid, marginaleffects).
googled it.
extensively in the following. Here the lines, I am talking about:
tmp <- prep_datagrid(..., model = model, newdata = newdata) at <- tmp$at dat <- tmp$newdata variables_all <- tmp$all variables_manual <- tmp$variables_manual variables_automatic <- tmp$variables_automatic
at
-> data.frame
at <- lapply(at, unique) at <- expand.grid(at, stringsAsFactors = FALSE)
brought the ability to handle mlogit objects.
version???
advanced in coding efficiently than You.
some time to open a repos and load some data, where I can
############################################ ############# Get the formula object to get independent vars ############################################
values and counterfactuals
that, when neccessary
form.chr <- unlist(strsplit(as.character(attributes(dat)$formula)[3], "[|]")) asc.vars <- str_replace_all(unlist(strsplit(form.chr[1], "[+]")), " ", "") is.vars <- str_replace_all(unlist(strsplit(form.chr[2], "[+]")), " ", "") asv.vars <- str_replace_all(unlist(strsplit(form.chr[3], "[+]")), " ", "") as.vars <- c(asc.vars, asv.vars) var.vec <- c(asc.vars, is.vars, asv.vars)
############################################ ############# Case 0: evaluation of dataset / AAP & AME ############################################
############################################ ############# Case 1: datagrid=() / APM & MEM ############################################
###############
conditioned on each alternative.
also relevant for case 2, see below ###############
dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset)))
temp.newdata <- dat.glob.mean
dat.choice.mean <- dat %>% group_by(altset.vec) %>% summarise(across(c(paste(var.vec)), mean) )%>% select(!altset.vec)
temp.newdata <- dat.choice.mean
############################################ ############# Case 2: datagrid=(vars.control=number.value) ############################################
###############
alternatives (as an option)
in the following fashion:
mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode) nd <- TravelMode[1:4, c("wait", "gcost", "income", "size")]
rows:
the rest of the data fixed.
dat.choice.rep.val.glob.mean <- dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset))) %>% mutate(var.rep.val = rep(number.value, length(altset))) %>% sapply(., rep.int, times=length(altset)) %>% data.frame(.) %>% mutate(altset.vec = as.character(rep(altset, length=dim(.)[1])) ) %>% group_by(altset.vec) %>% nest()
temp.newdata <- NULL for (i in 1:length(altset)){ dat.choice.rep.val.glob.mean[[2]][[i]][i, paste(vars.control)] <- dat.choice.rep.val.glob.mean[[2]][[i]][i," var.rep.val "] temp.newdata <- rbind(temp.newdata, dat.choice.rep.val.glob.mean[[2]][[i]] ) }
contrasts
############################################ ############# Control for exact mapping of rows to order of choice set ############################################
not be very complicated idx <- paste0(rep(altset, length=dim(temp.newdata)[1]), ":", sort(rep(seq(dim(temp.newdata)[1]/length(altset)), length = dim(temp.newdata)[1] ) ) )
############################################ ############# Calculation ############################################
###############
newdata <- cbind(idx, temp.altset.vec, temp.newdata)
pr.pred <- predict(model, newdata) me.pred <- effects(model, newdata, type='aa')
############################################ ############# Passing results to other marginaleffects functions ############################################
Am Di., 3. Mai 2022 um 21:46 Uhr schrieb Vincent Arel-Bundock < @.***>:
I'm not sure how that’s possible. For example, this newdata does not have any information about the index, yet it still produces predictions, and we have no way of knowing which individual they refer to:
library("AER")
library("mlogit")
data("TravelMode", package = "AER")
mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode)
nd <- TravelMode[1:4, c("wait", "gcost", "income", "size")]
predict(mod, newdata = nd)
> air train bus car
> 0.1693768 0.2727905 0.1783145 0.3795182
Moreover, if we just change the row ordering we get completely different results, and we have no way to verify the row order, since there is no no index in newdata:
nd <- nd[c(2, 1, 3, 4),]
predict(mod, newdata = nd)
> air train bus car
> 0.911253079 0.001247517 0.027969698 0.059529706
Of course, there is an index stored inside the model object:
mod$model$idx
>
~ indexes ~~> individual mode
> 1 1 air
> 2 1 train
> 3 1 bus
> 4 1 car
> 5 2 air
> 6 2 train
> 7 2 bus
> 8 2 car
> 9 3 air
> 10 3 train
> indexes: 1, 2
But my point is that we have no way of matching that index to an arbitrary newdata supplied by the user. And I was unable to find where the column names for the original index are stored in the model object.
— Reply to this email directly, view it on GitHub https://github.com/vincentarelbundock/marginaleffects/issues/159#issuecomment-1116497005, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYXRKOMLKUCO55CHA37KJMLVIF7CXANCNFSM5KNZIWNQ . You are receiving this because you were mentioned.Message ID: @.***>
Are you able to write on Github directly? The formatting via email mixes text and code and makes things very difficult to read.
I have a feeling that you are making things way more complicated than they need to be.
newdata
argument work is a function that returns TRUE
if the newdata
that user supplies strictly conforms to expectations, and FALSE
otherwise. I don’t know how to write that function, and I don’t think we can assume that every user of mlogit
uses insight::get_data
.To use the currently working beta version, install from Github:
library(remotes)
install_github("vincentarelbundock/marginaleffects")
RESTART R
completely, and then try this:
library("AER")
library("mlogit")
data("TravelMode", package = "AER")
mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode)
marginaleffects(mod, variables = "income") |> summary()
#> Average marginal effects
#> Group Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
#> 1 air income 0.002786 0.001218 2.2878 0.022149 0.0003992 0.005172
#> 2 bus income -0.000372 0.001103 -0.3373 0.735896 -0.0025337 0.001790
#> 3 car income 0.003373 0.001373 2.4561 0.014045 0.0006814 0.006065
#> 4 train income -0.005787 0.001319 -4.3859 1.155e-05 -0.0083727 -0.003201
#>
#> Model type: mlogit
#> Prediction type: response
I temporarily removed the block on newdata
to illustrate that things almost work already. Here are the marginal effects and adjusted predictions at the mean:
nd <- datagrid(mode = TravelMode$mode, newdata = TravelMode)
marginaleffects(mod, newdata = nd, variables = "income")
#> rowid type group term dydx std.error statistic p.value conf.low conf.high individual choice wait vcost travel gcost
#> 1 1 response air income 6.656530e-03 2.426455e-03 2.7433152 0.006082228 1.900766e-03 1.141229e-02 1 no 34.58929 47.76071 486.1655 110.8798
#> 2 2 response train income -5.521732e-03 1.910701e-03 -2.8898981 0.003853667 -9.266638e-03 -1.776826e-03 1 no 34.58929 47.76071 486.1655 110.8798
#> 3 3 response bus income -1.141280e-03 9.454854e-04 -1.2070837 0.227399906 -2.994397e-03 7.118373e-04 1 no 34.58929 47.76071 486.1655 110.8798
#> 4 4 response car income 6.482222e-06 2.031464e-05 0.3190912 0.749657342 -3.333374e-05 4.629818e-05 1 no 34.58929 47.76071 486.1655 110.8798
#> income size mode
#> 1 34.54762 1.742857 air
#> 2 34.54762 1.742857 train
#> 3 34.54762 1.742857 bus
#> 4 34.54762 1.742857 car
predictions(mod, newdata = nd)
#> rowid type group predicted std.error individual choice wait vcost travel gcost income size mode
#> 1 1 response air 0.831762528 0.053985678 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 air
#> 2 2 response train 0.107734987 0.037727504 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 train
#> 3 3 response bus 0.058853854 0.021442837 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 bus
#> 4 4 response car 0.001648631 0.001116992 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 car
Again, all we would need is a single function that can be 100% sure that an arbitrary data frame supplied to the newdata
argument conforms to our specifications.
Hey Vince,
here a longer reply to Your Emails:
####### Your Email yesterday: But my point is that we have no way of matching that index to an arbitrary newdata supplied by the user. And I was unable to find where the column names for the original index are stored in the model object. Moreover, if we just change the row ordering we get completely different results, and we have no way to verify the row order, since there is no no index in newdata: nd <- nd[c(2, 1, 3, 4),]
For sure, You are completely right! It is just allowed to rotate the order of the N´s, but not the k´s within N. But this is the same for all structured long data like panel data etc... Users of mlogit shoulda know: The idx var is construced by dfidx or pivot_longer: There you have to assign the columns for identification of the observations plus the choice / outcome variable in order to yield a long dataframe with N*k rows (k=number of alternatives).
Actually it is possible to decompose the idx of the original data, yet mostly not relevant: Perhaps I am too optimistic, but I believe, that most of the applications of marginaleffects include a model and with its data (insight::). Thus the intended calculations are build on the correct data structure, especially when the datagrid function consequently expands the data by strictly considering the alternative vector as I have tried to fulfill with the procedures in the below, which work with altset and altset.vec
dat <- insight::get_data(model) %>% mutate(altset.vec = rep(levels(dat$idx$id2), length=dim(dat)[1])) altset <- levels(dat$idx$id2) ### Just categories of depvar y
Below I wrote some lines, which enables the possibility to check this. (Control for exact mapping of rows to order of choice set) Maybe this could also be done via a customized sanitize function for mlogit. (Though I believe that won´t be necessary)
I really would like to give You a more solid statement and yday I studied Your code in detail (predictions, datagrid, marginaleffects). I mostly understood the procedure, but there are two pieces missing:
1) I was not able to find the code for the prep_datagrid, even though I googled it. Somehow with this function a variable 'at' is constructed, which is used extensively in the following. Here the lines, I am talking about:
tmp <- prep_datagrid(..., model = model, newdata = newdata) at <- tmp$at dat <- tmp$newdata variables_all <- tmp$all variables_manual <- tmp$variables_manual variables_automatic <- tmp$variables_automatic
at
-> data.frame
at <- lapply(at, unique)
at <- expand.grid(at, stringsAsFactors = FALSE)
2) I was not able to update the package nor a copy of the new functions brought the ability to handle mlogit objects. I copied the sanitize model routine, too. Is there a link to the beta version???
I hope, I don´t ask silly questions, but the truth is, I am by far not so advanced in coding efficiently than You. I use github for the first time and hopefully in the evening I will find some time to open a repos and load some data, where I can give You the data and model of my application.
CU, Norbert
############################################ ############# Get the formula object to get independent vars ############################################
form.chr <- unlist(strsplit(as.character(attributes(dat)$formula)[3], "[|]")) asc.vars <- str_replace_all(unlist(strsplit(form.chr[1], "[+]")), " ", "") is.vars <- str_replace_all(unlist(strsplit(form.chr[2], "[+]")), " ", "") asv.vars <- str_replace_all(unlist(strsplit(form.chr[3], "[+]")), " ", "") as.vars <- c(asc.vars, asv.vars) var.vec <- c(asc.vars, is.vars, asv.vars)
############################################ ############# Case 0: evaluation of dataset / AAP & AME ############################################
############################################ ############# Case 1: datagrid=() / APM & MEM ############################################
###############
###############
dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset)))
temp.newdata <- dat.glob.mean
dat.choice.mean <- dat %>% group_by(altset.vec) %>% summarise(across(c(paste(var.vec)), mean) )%>% select(!altset.vec)
temp.newdata <- dat.choice.mean
############################################ ############# Case 2: datagrid=(vars.control=number.value) ############################################
###############
mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode) nd <- TravelMode[1:4, c("wait", "gcost", "income", "size")]
dat.choice.rep.val.glob.mean <- dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset))) %>% mutate(var.rep.val = rep(number.value, length(altset))) %>% sapply(., rep.int, times=length(altset)) %>% data.frame(.) %>% mutate(altset.vec = as.character(rep(altset, length=dim(.)[1])) ) %>% group_by(altset.vec) %>% nest()
temp.newdata <- NULL for (i in 1:length(altset)){ dat.choice.rep.val.glob.mean[[2]][[i]][i, paste(vars.control)] <- dat.choice.rep.val.glob.mean[[2]][[i]][i," var.rep.val "] temp.newdata <- rbind(temp.newdata, dat.choice.rep.val.glob.mean[[2]][[i]] ) }
############################################ ############# Control for exact mapping of rows to order of choice set ############################################
idx <- paste0(rep(altset, length=dim(temp.newdata)[1]), ":", sort(rep(seq(dim(temp.newdata)[1]/length(altset)), length = dim(temp.newdata)[1] ) ) )
############################################ ############# Calculation ############################################
###############
newdata <- cbind(idx, temp.altset.vec, temp.newdata)
pr.pred <- predict(model, newdata) me.pred <- effects(model, newdata, type='aa')
############################################ ############# Passing results to other marginaleffects functions ############################################
Hey Vince, here my reply to ur post from yday. In the beginning a short comment to ur question. Below I used the dataset and model to illustrate the preliminary code from yesterday. See below Please comment also the rest of the application. Sorry for the different type style. Why does this happen, even if I did not mark sth to bold or highlighten certain statemens...
Best,
Norbert
library("AER") library("mlogit") library(marginaleffects) library(stringi) library(tidyverse) library(texreg) library(ggeffects)
data("TravelMode", package = "AER") mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode) summary(mod) marginaleffects(mod, variables = "income") |> summary()
nd <- datagrid(mode = TravelMode$mode, newdata = TravelMode) marginaleffects(mod, newdata = nd, variables = "income") predictions(mod, newdata = nd)
#######
######## dat <- insight::get_data(mod) dat <-dat %>% mutate(altset.vec = rep(levels(dat$idx$mode), length=dim(dat)[1])) altset <- levels(dat$idx$mode) ### Just categories of depvar y
form.chr <- unlist(strsplit(as.character(attributes(dat)$formula)[3], "[|]")) asc.vars <- str_replace_all(unlist(strsplit(form.chr[1], "[+]")), " ", "") is.vars <- str_replace_all(unlist(strsplit(form.chr[2], "[+]")), " ", "") asv.vars <- str_replace_all(unlist(strsplit(form.chr[3], "[+]")), " ", "") asv.vars <- NULL as.vars <- c(asc.vars, asv.vars) var.vec <- c(asc.vars, is.vars, asv.vars)
dat.glob.mean <- dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset)))
marginaleffects(mod, newdata = nd, variables = "income") marginaleffects(mod, newdata = dat.glob.mean, variables = "income")
dat.choice.mean <- dat %>% group_by(altset.vec) %>% summarise(across(c(paste(var.vec)), mean) ) %>% select(!altset.vec) dat.choice.mean
marginaleffects(mod, newdata = dat.choice.mean, variables = "income")
############################################ ############# Case 2: datagrid=(vars.control=number.value) ############################################
number.value=50 nd1 <- datagrid(mode = TravelMode$mode, income=number.value, newdata = TravelMode) marginaleffects(mod, newdata = nd1, variables = "income")
number.value=30
nd2 <- datagrid(mode = TravelMode$mode, gcost=c(30), newdata = TravelMode) marginaleffects(mod, newdata = nd2, variables = "income")
vars.control <- c("gcost")
dat.choice.rep.val.glob.mean <- dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset))) %>% mutate(var.rep.val = rep(number.value, length(altset))) %>% sapply(., rep.int, times=length(altset)) %>% data.frame(.) %>% mutate(altset.vec = as.character(rep(altset, length=dim(.)[1])) ) %>% group_by(altset.vec) %>% nest()
temp.newdata <- NULL for (i in 1:length(altset)){ dat.choice.rep.val.glob.mean[[2]][[i]][i, paste(vars.control)] <- dat.choice.rep.val.glob.mean[[2]][[i]][i,"var.rep.val"] temp.newdata <- rbind(temp.newdata, dat.choice.rep.val.glob.mean[[2]][[i]] ) }
temp.newdata <- temp.newdata %>% mutate(altset.vec.nd = rep(altset, length=dim(temp.newdata)[1]))
idx.nd <- paste0(rep(altset, length=dim(temp.newdata)[1]), ":", sort(rep(seq(dim(temp.newdata)[1]/length(altset)), length = dim(temp.newdata)[1] ) ) )
nd3 <- cbind(idx.nd, temp.newdata)
marginaleffects(mod, newdata = nd3, variables = vars.control)
dat.choice.rep.val.choice.mean <- dat %>% group_by(altset.vec) %>% summarise(across(c(paste(var.vec)), mean) )%>% select(!altset.vec) %>% mutate(var.rep.val = rep(number.value, length(altset))) %>% sapply(., rep.int, times=length(altset)) %>% data.frame(.) %>% mutate(altset.vec = as.character(rep(altset, length=dim(.)[1])) ) %>% group_by(altset.vec) %>% nest()
temp.newdata <- NULL for (i in 1:length(altset)){ dat.choice.rep.val.choice.mean[[2]][[i]][i, paste(vars.control)] <- dat.choice.rep.val.choice.mean[[2]][[i]][i,"var.rep.val"] temp.newdata <- rbind(temp.newdata, dat.choice.rep.val.choice.mean[[2]][[i]] ) }
temp.newdata <- temp.newdata %>% mutate(altset.vec.nd = rep(altset, length=dim(temp.newdata)[1]))
idx.nd <- paste0(rep(altset, length=dim(temp.newdata)[1]), ":", sort(rep(seq(dim(temp.newdata)[1]/length(altset)), length = dim(temp.newdata)[1] ) ) )
nd4 <- cbind(idx.nd, temp.newdata)
marginaleffects(mod, newdata = nd4, variables = vars.control)
For formatting, please see this link:
In particular, note that (1) code blocks must be preceded AND followed by lines with three back-ticks, (2) hashtags denote sections in the markdown syntax.
At a high level, this is what I got from your last post:
newdata
argument works exactly as intended, and allows you to compute the quantities you need.datagrid()
function does not produce the types of data frames that you think would be useful for mlogit
models.I don’t want to be too glib, but maybe you should just avoid datagrid
… That function was conceived as a shortcut to construct a specific type of dataset that is useful for many models: 60 of the 61 model types currently supported by marginaleffects
, with the possible exception of mlogit
.
But the whole point of the interface design is that newdata
can accept arbitrary data frames constructed by users. There is nothing forcing you to use datagrid()
to build those data frames.
In fact, all the examples you give in your post seem trivial to build. And since they are so trivial, maybe the best course of action is not to add a ton of code complexity to marginaleffects
, but instead to write an example vignette, showing super simple code that replicates some of your examples:
library("AER")
library("mlogit")
library("tidyverse")
data("TravelMode", package = "AER")
mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode)
# Case 1.1: Global mean
TravelMode %>%
summarize(across(c("wait", "gcost", "income", "size"),
function(x) rep(mean(x), 4)))
#> wait gcost income size
#> 1 34.58929 110.8798 34.54762 1.742857
#> 2 34.58929 110.8798 34.54762 1.742857
#> 3 34.58929 110.8798 34.54762 1.742857
#> 4 34.58929 110.8798 34.54762 1.742857
# Case 1.2: Mean conditioned on choice
TravelMode %>%
group_by(mode) %>%
summarize(across(c("wait", "gcost", "income", "size"), mean))
#> # A tibble: 4 × 5
#> mode wait gcost income size
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 air 61.0 103. 34.5 1.74
#> 2 train 35.7 130. 34.5 1.74
#> 3 bus 41.7 115. 34.5 1.74
#> 4 car 0 95.4 34.5 1.74
# Only one alternative is affected by cost reduction
nd3 <- datagrid(mode = TravelMode$mode, newdata = TravelMode)
nd3 <- lapply(1:4, function(i) mutate(nd3, gcost = ifelse(1:4 == i, 30, gcost)))
bind_rows(nd3)
#> individual choice wait vcost travel gcost income size mode
#> 1 1 no 34.58929 47.76071 486.1655 30.0000 34.54762 1.742857 air
#> 2 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 train
#> 3 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 bus
#> 4 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 car
#> 5 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 air
#> 6 1 no 34.58929 47.76071 486.1655 30.0000 34.54762 1.742857 train
#> 7 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 bus
#> 8 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 car
#> 9 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 air
#> 10 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 train
#> 11 1 no 34.58929 47.76071 486.1655 30.0000 34.54762 1.742857 bus
#> 12 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 car
#> 13 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 air
#> 14 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 train
#> 15 1 no 34.58929 47.76071 486.1655 110.8798 34.54762 1.742857 bus
#> 16 1 no 34.58929 47.76071 486.1655 30.0000 34.54762 1.742857 car
Hey Vince,
thx for Your referral to formatting Will keep that in mind. Secondly it was helpful to see how You code the setup of the newdata. I will use functionals more often than before and I am updating my R programming skills with Advanced R (H. Wickham). I learned R autodidactical and had a break the last two years, in which a lot progress has happening (tidyverse, purr...). Worthful to exploit...
You propose to bypass the datagrid thing by an example vignette. Surely, this might be an option. Even though it is my impression that with small refinements datagrid would also work for mlogit objects, if the expansion strictly relies on the idx of the mlogit object. As I have written, I was not able to understand every step of the datagrid function because I neither find prep_datagrid on Your github nor in R or in the web. So I just had only an impression that a fully incorporation of mlogit in the marginaleffects workflow would not go substantially beyond the code design. To my opinion, the more models share the same procedure and workflow, the better it is. Maybe there can be a mixed strategy. Vignette in the short run, full implementation in the long run.
Eventually it is up to You. N
Yeah, learning R feels like a life long endeavor. The good news is that it's fun, isn't it?!
I added a short vignette with some of the examples you supplied: https://vincentarelbundock.github.io/marginaleffects/articles/mlogit.html
Will keep thinking about this. Feel free to open an issue on Github if you have specific feature requests for datagrid()
.
Jup, da core. I love the data analyses stuff with R very much! Moreover, the whole project is a wonderful thing. Open-Source, powerful and stable, developement by many people taking the efforts without monetary reward. Isn´t it amazing? We are talking about collective action on a public good. What a pity, that Mancur Olson has missed this...
Specific feature requests for datagrid... I am not sure, what do you mean exactly: My question regarding prep_datagrid? Will do so, but before I open an issue, I better check if I understood you right.. Bye
I guess I'd like to know exactly what is the structure of all the datasets you would like to create with datagrid()
. Can't guarantee I'll be able to make them all, but we can try...
Hey,
You mean Yves Croissant? Nice! It will be interesting how he will implement a way of dealing with row-wise deletion within the choicesets.
However, my major problem of calculating marginal effects has been linked to newdata. Esp the data preparation built by means for some independent vars while varying other independents vars of interest and mapping rows ot fhe prepared data to the correct idx needed a custom made solution. My impression was that this is more than just case specific due to my research questions.
I will check if the routine in ur package works in the necessary way for mlogit but as I remember I tried code snippets of marginaleffects too. I think on Tuesday I can figure something out and will let you know.
Have a nice weekend,
Norbert
Am Fr., 22. Apr. 2022 um 13:13 Uhr schrieb Vincent Arel-Bundock < @.***>:
@NorbertSch101 https://github.com/NorbertSch101 FYI, I have been in touch with the mlogit developer and he seems interested in looking for an upstream fix. I think we should wait a bit before diving too deep into this.
— Reply to this email directly, view it on GitHub https://github.com/vincentarelbundock/marginaleffects/issues/159#issuecomment-1106410619, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYXRKOOOR3FEBEAVVGYWJIDVGKCWPANCNFSM5KNZIWNQ . You are receiving this because you were mentioned.Message ID: @.***>
Ah, I just saw that u are in the political science. My application basically deals with voting behavior data and spatial modeling and so on
Holen Sie sich Outlook für Androidhttps://aka.ms/AAb9ysg
From: Norbert Schoening @.>
Sent: Wednesday, May 4, 2022 5:59:26 PM
To: vincentarelbundock/marginaleffects @.>
Subject: Re: [vincentarelbundock/marginaleffects] Support: mlogit
(Issue #159)
############################# ####### Your Email yesterday: #############################
dat <- insight::get_data(model) %>% mutate(altset.vec = rep(levels(dat$idx$id2), length=dim(dat)[1])) altset <- levels(dat$idx$id2) ### Just categories of depvar y
tmp <- prep_datagrid(..., model = model, newdata = newdata) at <- tmp$at dat <- tmp$newdata variables_all <- tmp$all variables_manual <- tmp$variables_manual variables_automatic <- tmp$variables_automatic
at
-> data.frame
at <- lapply(at, unique) at <- expand.grid(at, stringsAsFactors = FALSE)
############################################ ############# Get the formula object to get independent vars ############################################
form.chr <- unlist(strsplit(as.character(attributes(dat)$formula)[3], "[|]")) asc.vars <- str_replace_all(unlist(strsplit(form.chr[1], "[+]")), " ", "") is.vars <- str_replace_all(unlist(strsplit(form.chr[2], "[+]")), " ", "") asv.vars <- str_replace_all(unlist(strsplit(form.chr[3], "[+]")), " ", "") as.vars <- c(asc.vars, asv.vars) var.vec <- c(asc.vars, is.vars, asv.vars)
############################################ ############# Case 0: evaluation of dataset / AAP & AME ############################################
############################################ ############# Case 1: datagrid=() / APM & MEM ############################################
###############
###############
dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset)))
temp.newdata <- dat.glob.mean
dat.choice.mean <- dat %>% group_by(altset.vec) %>% summarise(across(c(paste(var.vec)), mean) )%>% select(!altset.vec)
temp.newdata <- dat.choice.mean
############################################ ############# Case 2: datagrid=(vars.control=number.value) ############################################
###############
mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode) nd <- TravelMode[1:4, c("wait", "gcost", "income", "size")]
dat.choice.rep.val.glob.mean <- dat %>% summarise(across(c(paste(var.vec)), mean) ) %>% slice(rep(1:n(), each = length(altset))) %>% mutate(var.rep.val = rep(number.value, length(altset))) %>% sapply(., rep.inthttp://rep.int, times=length(altset)) %>% data.frame(.) %>% mutate(altset.vec = as.character(rep(altset, length=dim(.)[1])) ) %>% group_by(altset.vec) %>% nest()
temp.newdata <- NULL for (i in 1:length(altset)){ dat.choice.rep.val.glob.mean[[2]][[i]][i, paste(vars.control)] <- dat.choice.rep.val.glob.mean[[2]][[i]][i," var.rep.val "] temp.newdata <- rbind(temp.newdata, dat.choice.rep.val.glob.mean[[2]][[i]] ) }
############################################ ############# Control for exact mapping of rows to order of choice set ############################################
idx <- paste0(rep(altset, length=dim(temp.newdata)[1]), ":", sort(rep(seq(dim(temp.newdata)[1]/length(altset)), length = dim(temp.newdata)[1] ) ) )
############################################ ############# Calculation ############################################
###############
newdata <- cbind(idx, temp.altset.vec, temp.newdata)
pr.pred <- predict(model, newdata) me.pred <- effects(model, newdata, type='aa')
############################################ ############# Passing results to other marginaleffects functions ############################################
Am Di., 3. Mai 2022 um 21:46 Uhr schrieb Vincent Arel-Bundock @.**@.>>:
I'm not sure how that’s possible. For example, this newdata does not have any information about the index, yet it still produces predictions, and we have no way of knowing which individual they refer to:
library("AER")
library("mlogit")
data("TravelMode", package = "AER")
mod <- mlogit(choice ~ wait + gcost | income + size, TravelMode)
nd <- TravelMode[1:4, c("wait", "gcost", "income", "size")]
predict(mod, newdata = nd)
Moreover, if we just change the row ordering we get completely different results, and we have no way to verify the row order, since there is no no index in newdata:
nd <- nd[c(2, 1, 3, 4),]
predict(mod, newdata = nd)
Of course, there is an index stored inside the model object:
mod$model$idx
But my point is that we have no way of matching that index to an arbitrary newdata supplied by the user. And I was unable to find where the column names for the original index are stored in the model object.
— Reply to this email directly, view it on GitHubhttps://github.com/vincentarelbundock/marginaleffects/issues/159#issuecomment-1116497005, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AYXRKOMLKUCO55CHA37KJMLVIF7CXANCNFSM5KNZIWNQ. You are receiving this because you were mentioned.Message ID: @.***>
Hi Guys - did anything come of this? I.e is there a way to predict long formatted data using the predict function?
Thanks
@WITT101 did you read this vignette?
https://vincentarelbundock.github.io/marginaleffects/articles/mlogit.html#mlogit-package
If that did not answer your question, can you show me a VERY SIMPLE model using a SMALL public dataset in long format on which you would like predictions?
What is the code you used to estimate a model on these data?
Your sample data is not "balanced" (see vignette). On my computer it looks like the predictions()
function works fine in balanced data:
library(mlogit)
library(marginaleffects)
ex <- read.csv("example_predict.csv")
dat <- expand.grid(
chid.var = unique(ex$chid.var),
alt.var = unique(ex$alt.var))
dat$choice.id <- sample(c(TRUE, FALSE), nrow(dat), replace = TRUE)
dat$IV_1 <- sample(ex$IV_1, nrow(dat), replace = TRUE)
dat$IV_2 <- sample(ex$IV_2, nrow(dat), replace = TRUE)
dat$IV_3 <- sample(ex$IV_3, nrow(dat), replace = TRUE)
new.data <- mlogit.data(
data = dat,
choice = "choice.id",
chid.var = "chid.var",
alt.var = "alt.var",
shape = "long")
mod <- mlogit(
choice.id ~ IV_1 + IV_2 + IV_3 | 0 | 0,
data = new.data)
predictions(mod)
Hi @vincentarelbundock, glad to know that mlogit is also supported. I used it for commute mode choice modeling. I noticed that you composed a brilliant tutorial at https://vincentarelbundock.github.io/marginaleffects/articles/mlogit.html, but it is not yet indexed in the Articles page https://vincentarelbundock.github.io/marginaleffects/articles/index.html
It's indexed as "Categorical outcomes"
The
predict()
method of themlogit
package seems to behave weirdly. We need one prediction per row per level of the outcome variable. How can we get those?