Open chevrotin opened 10 months ago
Thanks for checking out {logitr}!
Hmm, this will be hard to diagnose if I can reproduce the error.
Here's a similar example using the yogurt
data that comes with the package. If I set a high number of draws, it seems to work without error:
library(logitr)
model <- logitr(
data = yogurt,
outcome = 'choice',
obsID = 'obsID',
panelID = 'id',
pars = c('feat', 'brand'),
scalePar = 'price',
randPars = c(feat = 'n', brand = 'n'),
numDraws = 2000,
numCores = 1,
drawType = 'sobol',
numMultiStarts = 10
)
My only guess is that you may be running into a memory issue. You have 10 random parameters, so with 2000 draws you're looking at a 20 x 2000 matrix storing the draws (since you have a mean and sd parameter for each variable). So even without attempting to run the multi-start in parallel, memory might become an issue.
How large is the dataset you're working with?
Thanks for your quick reply.
It isn't small. The dataset has over 1000 individuals, 12 choice sets if I recall correctly. But it is not unusually big for what I have worked on. It seems the problem is likely when a larger number of draws (>1000) is specified. It would seem to me that the memory problem would manifest in many applications. I'm curious about your take on the number of draws need.
From a conference paper where John Rose is a coauthor, they say: "Multivariate normality. With 7 or more random parameters, the Henze-Zirkler p-value decreased. Keeping the p-value above .05 with 11 random parameters required 4,000 draws. . "
https://sswr.confex.com/sswr/2020/webprogram/Paper39930.html
Yeah I think you're right, and this is also quite related to #49. That issue is a very, very long issue thread where we explore a huge series of tests to assess whether or not we're getting a biased estimator from logitr, and the conclusion seems to be that we simply need more draws. That is, with too few draws, we can see some evidence of departure from normality with the random parameters, which is what the paper you linked to concludes as well.
I think this is a pretty big issue with mixed logit in general, and it's something that @crforsythe have also discussed (tagging him here because he may have other thoughts or references to suggest on this topic). The need for thousands of draws changes the computational challenge for mixed logit, and really the only useful package I've seen that can handle that number of draws is xlogit, a python package by @arteagac. It has a very similar UI structure with logitr, so you should be able to use it successfully. He (quite brilliantly) uses GPUs to accelerate the processing of very large numbers of draws. It's the only package that I've seen that can handle upwards of 10,000 draws without running into memory issues. I've thought about incorporating his innovations into {logitr} to gain the same levels of speed improvements, but I also weigh the time it takes to do that work versus just use xlogit. It might be actually easier to simply call xlogit under the hood via python from R, which is another idea I've though about as maybe just an option for {logitr}.
Anyway, let me know if you give xlogit a try and how it works out - I think it might solve your problem. I do think it's simply running out of memory. I just pushed up the draws to 4,000 on a recently experiment we did last year that had only 5 random parameters and I hit a memory limit (on a machine that has 32 GB of RAM mind you). {logitr} is faster than most similar R packages, but once we start getting over ~1,000 draws it gobbles up memory the way it's written.
Hi everybody, I just wanted to chip in to mention the study by Czajkowski, M., & Budziński (2019) that analyzes simulation errors in Mixed Logit based on the number of random draws. They concluded that 1) for 5-attribute designs, over 2,000 random draws were needed, and 2) in the case of 10 attributes, over 20,000 draws were needed.
xlogit supports estimation using hundreds of thousands (and even millions) of random draws. I used xlogit to estimate a model with half-million random draws and it took 5 minutes in my old desktop computer with 16GB RAM and a GPU with 6GB of VRAM. If you don't have a GPU, the same estimation will take about 30 minutes. The model included 6 random parameters, 4,308 choice situations, and 4 alternatives. xlogit easily escalates to such large number of draws because it uses batch processing, which is a concept borrowed from the deep learning literature.
I'd be happy to answer any questions you may have about xlogit.
Thank you both for your wonderful contributions.
On Fri, Dec 15, 2023 at 5:18 PM Cristian Arteaga @.***> wrote:
Hi everybody, I just wanted to chip in to mention the study by Czajkowski, M., & Budziński (2019) https://doi.org/10.1016/j.jocm.2019.04.003 that analyzes simulation errors in Mixed Logit based on the number of random draws. They concluded that 1) for 5-attribute designs, over 2,000 random draws were needed, and 2) in the case of 10 attributes, over 20,000 draws were needed.
xlogit https://github.com/arteagac/xlogit/ supports estimation using hundreds of thousands (and even millions) of random draws. I used xlogit to estimate a model with half-million random draws and it took 5 minutes in my old desktop computer with 16GB RAM and a GPU with 6GB of VRAM. If you don't have a GPU, the same estimation will take about 30 minutes. The model included 6 random parameters, 4,308 choice situations, and 4 alternatives. xlogit easily escalates to such large number of draws because it uses batch processing, which is a concept I borrowed from the deep learning literature.
— Reply to this email directly, view it on GitHub https://github.com/jhelvy/logitr/issues/53#issuecomment-1858558661, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZQNYBSGJKJCQPVQNFPCEPLYJTEDLAVCNFSM6AAAAABAW5HN4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGU2TQNRWGE . You are receiving this because you authored the thread.Message ID: @.***>
Let me know if xlogit works out! Cristian's done a fantastic job with it, so I'm sure it will.
I do believe it's possible to simply call xlogit from R using the {reticulate} package, so that's something I'll think about as an option. I really don't see a lot of value in re-implementing what Cristian has already done in {logitr}, and I'm not even sure if R would support the same toolchain he's using in python. So if I can find a way to run xlogit under the hood, that might be a nice compromise. You'd have to have python installed along with xlogit, but you'd be able to use the resulting estimated model in R just like any other model estimated by {logitr}.
Hey John,
I'm diving into how WTP Space is implemented on xlogit now. It would be awesome if different starting values can be tried as in logitr now.
Would it be possible to run logitr on Google colab? It would potentially resolve the memory issue if Google is generous enough with the computing power. Some hints here: How to use R with Google Colaboratory? - GeeksforGeeks https://www.geeksforgeeks.org/how-to-use-r-with-google-colaboratory/#
I hope I'm not taking too much of your time. I always have STATA as backup; the days of waiting for results are relaxing sometimes.
Best, Kar
On Sat, Dec 16, 2023 at 10:01 AM John Helveston @.***> wrote:
Let me know if xlogit works out! Cristian's done a fantastic job with it, so I'm sure it will.
I do believe it's possible to simply call xlogit from R using the {reticulate} package, so that's something I'll think about as an option. I really don't see a lot of value in re-implementing what Cristian has already done in {logitr}, and I'm not even sure if R would support the same toolchain he's using in python. So if I can find a way to run xlogit under the hood, that might be a nice compromise. You'd have to have python installed along with xlogit, but you'd be able to use the resulting estimated model in R just like any other model estimated by {logitr}.
— Reply to this email directly, view it on GitHub https://github.com/jhelvy/logitr/issues/53#issuecomment-1858838327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZQNYBSUUFT6DYEFJRNE7QTYJWZWLAVCNFSM6AAAAABAW5HN4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYHAZTQMZSG4 . You are receiving this because you authored the thread.Message ID: @.***>
Actually yeah I helped with getting WTP space models in xlogit, so I believe it's supported now. I think if you want to implement a multi-start you have to do it manually yourself, like writing a for loop and use different starting points each time.
You definitely can use logitr in colab. In fact, it's what I did to benchmark it against other R packages: https://colab.research.google.com/drive/1vYlBdJd4xCV43UwJ33XXpO3Ys8xWkuxx?usp=sharing
You can open a colab notebook then just change the engine to R.
Hi @chevrotin, Yes, xlogit supports WTP space models thanks to @jhelvy's support. Please see the code below as a reference for the Yogurt dataset. I also added this code to the Google Colab notebook below, in which you can run estimations using their free GPU's. Using a GPU in Colab, the estimation takes about 2 minutes for 10,000 Halton draws.
https://colab.research.google.com/drive/10hDEcobMCTtLwXT269saIIfa7NKVfPyd?usp=sharing
import pandas as pd
import numpy as np
from xlogit import MixedLogit, MultinomialLogit
# STEP 1. Read Data
df = pd.read_csv("https://raw.githubusercontent.com/arteagac/xlogit/master/examples/data/yogurt_long.csv")
# STEP 2. Convert brand to factor (dummy) representation
df["brand_yoplait"] = 1*(df["alt"] == "yoplait")
df["brand_hiland"] = 1*(df["alt"] == "hiland")
df["brand_weight"] = 1*(df["alt"] == "weight")
# STEP3. Run Mixed Logit in WTP space. Note that you need to provide a `scale_factor` (price)
varnames = ["feat", "brand_yoplait", "brand_hiland", "brand_weight"]
ml = MixedLogit()
ml.fit(X=df[varnames],
y=df["choice"],
varnames=varnames,
ids=df["chid"],
alts=df["alt"],
panels=df['id'],
randvars={"feat": "n", "brand_yoplait": "n", "brand_hiland": "n", "brand_weight": "n"},
n_draws=10000,
scale_factor=df["price"],
verbose=2)
ml.summary()
OUTPUT:
Optimization terminated successfully.
Message: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
Iterations: 68
Function evaluations: 83
Estimation time= 117.7 seconds
---------------------------------------------------------------------------
Coefficient Estimate Std.Err. z-val P>|z|
---------------------------------------------------------------------------
feat 2.1499924 0.6508688 3.3032654 0.00097 ***
brand_yoplait 1.3029015 0.7382473 1.7648578 0.0777 .
brand_hiland -11.8661964 1.4159056 -8.3806409 8.82e-17 ***
brand_weight -6.8515833 1.2991364 -5.2739523 1.45e-07 ***
sd.feat 2.6531758 0.7187740 3.6912516 0.000228 ***
sd.brand_yoplait 7.9981323 1.0908153 7.3322514 3.07e-13 ***
sd.brand_hiland 5.6861943 1.1612862 4.8964626 1.04e-06 ***
sd.brand_weight 8.9809715 1.2614834 7.1193734 1.42e-12 ***
_scale_factor 0.4610556 0.0417985 11.0304253 1.23e-27 ***
---------------------------------------------------------------------------
Significance: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood= -1245.232
AIC= 2508.464
BIC= 2560.558
@arteagac Seriously I don't think I've told you enough how amazing xlogit is. I don't think there's another package out there that could even estimate a mixed logit with 10k draws. I know logitr and apollo would break down, and I'm pretty sure Stata would crash too. Just truly wonderful work.
I'm serious about seeing if I can link things up to run xlogit from inside logitr. With the {reticulate} package, you can pretty smoothly run python scripts from inside R and then extract the python objects back into R. My only reason to do that is just to make xlogit accessible to R users who don't want to learn python. It'd just make it easier to keep the rest of their workflows the same, like creating summary figures and tables from the estimated models in R. I think that'd be the better (and faster) way to have a similar R package compared to trying to replicate what you've done as a separate R package. I could just make it an option, like xlogit = TRUE
to use the xlogit package to estimate the model. Never tried this before, but I might play with the idea.
Thanks a lot for your comments, John! I definitely see the value of integrating xlogit calls within logitr. Both packages follow a very similar function call structure, so I think this integration is feasible and not too complex. I would be happy to help you address any friction points and guide you through the extraction of the Python objects containing the estimation results. I agree that this would be very useful for R users who prefer to maintain their workflows within R and leverage the additional nice features offered by logitr, such as the integration with cbcTools, the more detailed post-estimation summaries, and the multi-start estimation.
Cool, I just made a new issue (#54) to move that conversation there. Don't know when I'll get to work on this, perhaps in the spring, but if not maybe in the summer.
Thanks again. Combining xlogit and logitr would get us lightning speed and useful features. It would be very exciting for choice modelers.
Keep up the good work.
On Mon, Dec 18, 2023 at 1:46 PM John Helveston @.***> wrote:
Cool, I just made a new issue (#54 https://github.com/jhelvy/logitr/issues/54) to move that conversation there. Don't know when I'll get to work on this, perhaps in the spring, but if not maybe in the summer.
— Reply to this email directly, view it on GitHub https://github.com/jhelvy/logitr/issues/53#issuecomment-1861315435, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZQNYBTOTAVKAQPBVWSGRFDYKCFP5AVCNFSM6AAAAABAW5HN4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRRGMYTKNBTGU . You are receiving this because you were mentioned.Message ID: @.***>
Fantastic work, John. I greatly enjoy exploring the package thus far.
I've run into some problem in estimating a WTP-space model. The problem appears when I requested a larger number of draws. The code below would run fine regardless of the number of cores used when numDraws = 200. However, when I set numDraws = 2000, it would produce an error regardless of the number of cores specified. Any potential solution?
Running multistart... Random starting point iterations: 10 Number of cores: 1 Error in serialize(data, node$con) : error writing to connection
wtp <- logitr( data = dat, outcome = "choice", obsID = "cs", panelID = "id", pars = c("l", "l160", "l320", "prov", "can", "organic", "grass", "btest", "bfree", "cn"), scalePar = "price", randPars = c(l = 'n', l160 = 'n', l320 ='n', prov = 'n', can = 'n', organic = 'n', grass= 'n', btest = 'n', bfree = 'n', cn = 'n'), numMultiStarts = 10, numDraws = 2000, numCores = 1, drawType = 'sobol' )