jhelvy / logitr

Fast estimation of multinomial (MNL) and mixed logit (MXL) models in R with "Preference" space or "Willingness-to-pay" (WTP) space utility parameterizations in R
https://jhelvy.github.io/logitr/
Other
42 stars 15 forks source link

Inconsistent Estimator? #49

Open tux25 opened 1 year ago

tux25 commented 1 year ago

Hi, I am using your package frequently to estimate WTP model and saw this post https://stats.stackexchange.com/questions/624305/mlogit-logitr-packages-fail-to-recover-true-estimates-of-mixed-logit-random-co today. Can you please comment on that?

jhelvy commented 1 year ago

Thanks for bringing this to my attention, I'll look into it.

jhelvy commented 1 year ago

I replied with this comment. I believe the issue is simply not using enough draws for simulating the normal distributions with Halton draws. As I wrote in the comment, the default of 50 may just not be enough. I used 200 and the bias disappeared. I'm not sure what the "best" threshold is. Models with more parameters probably need more draws, so there may be some rule of thumb that scales with the number of parameters. I could easily increase the default number to perhaps 200 or so just to make sure this issue isn't there for models with smaller numbers of parameters. But the package already posts a warning when there are more random parameters and suggests using Sobol draws.

jhelvy commented 1 year ago

More updates on the stackexchange post. Even with higher numbers of draws there is still some small bias. I suspect the Halton draws may be the culprit. Still digging into the issue.

For sure though I am likely going to modify how the draws are done by default. I like how Stata’s cmmixlogit does it by using a flexible number of draws based on the model formulation, which makes sense (models with more random parameters really should use more draws).

JediKnight5 commented 1 year ago

Hi @jhelvy, as you have preferred, I am continuing the discussion here instead of following up in the Stackexchange thread.

Regarding your last comments in the Stackexchange thread:

I have attached my R script and the resulting dataset of comparing the performance of both logitr estimators (Halton and Sobol) to both mlogit estimators (pseudo random numbers and Halton) in a setting with $NumDraws = 100$ and $N=1000$. Don’t get irritated by lines 28-90 which can be easily simplified, I just wanted to allow for different constants across the alternatives in my estimation function. Moreover, I have attached my Stata code – which reads in the data generated by the R code – and the corresponding results dataset as csv file. material.zip

If you need any additional information, please let me know.

jhelvy commented 1 year ago

@JediKnight5 I appreciate you looking into this and adding your code here. I think this will be a lot easier to manage in this issue, and once we figure this out we can update the SO post and link back to this conversation.

Based on your Stata code, it looks like you exported the simulated data from the R code to csv files and then read those in to Stata for each model estimation? I think that's fine, I just wanted to make sure I'm understanding how the Stata simulations are being done.

I'm actually thinking a slightly different way of organizing this is to write a separate R script to generate the data and then save all of the data as one large dataframe, keeping track of things like the seed and iteration. If you save it as a .parquet file (arrow::write_parquet()) it will be quite small in size. Then that never needs to be updated - it's generated only once, and we can then use it to stress test all sorts of things in all of these packages.

I'm still thinking the consistent bias we're seeing is due to 1) the draws, 2) the optimizer, or 3) a combination of both. That you're getting the same results with a different algorithm doesn't necessarily rule it out because it could also be due to differences in convergence criteria. Basically, regardless of the algorithm used, it may still be stopping prematurely, and we may need to modify convergence criteria. I believe all of those controls can be passed in a call to logitr().

As for the draws, another idea is to try to export the draws being used themselves and then use those same draws in the other packages. I'm not sure if this can be done in the other packages, but in logitr you can pass a set of "standard draws" using the standardDraws argument. I haven't tested that argument too thoroughly, but I'd like to see if we could generate draws in different ways and then see what happens with different approaches. This would allow the ability to test strategies other than Halton and Sobol in logitr. The "standard draws" are supposed to be draws from a standard normal distribution, which are then scaled / shifted based on the model parameters. You can see how this is done in the code. The getStandardDraws function obtains the standard draws (using either halton or sobol), then the makeBetaDraws function shifts and scales those draws according to the parameters. I believe this is the way the other packages also work, but I'm not sure. As far as I can tell, I don't see what might lead to a slightly positive bias from using this approach. Note also that I'm using a separate package for making the halton or sobol draws (randtoolbox::sobol and randtoolbox::halton). I figured that package must be using a well-vetted approach for making these draws, but perhaps I'm wrong.

JediKnight5 commented 12 months ago

@jhelvy: Regarding your Stata question: Yes, your assumption is correct. I exported the data set I generated within my R code to a csv file (per iteration of the Monte-Carlo simulation).

Unfortunately, Stata cannot read in parquet files but I collected the data I generated for testing the different estimators in the following parquet file: LINK IS EXPIRED

I like your idea of keeping the draws of the standard normal distribution as a fixed input and compare the estimators’ performance accordingly. I think mlogit allows for this but, unfortunately, Stata’s cmmixlogit does not. I won’t have time in the next couple of days since this will probably require me to dive deeper into the code of logitr and mlogit but I will have a look at it by the end of next week.

The following simple piece of analysis comparing halton draws from different packages to pseudo random number draws doesn’t suggest that there is an issue with the randtoolbox package used in logitr. halton_vs_pesudo_random_draws.zip

jhelvy commented 12 months ago

All great progress on this! I think doing as much as we can to hold inputs fixed (simulated data, draws, etc.) and just vary the model settings will lead us to the path that gets down to the bottom of these differences.

One other thought I had was to also keep track of the log-likelihood achieved at the solution for each simulated dataset. Ideally these should all be converging to the same solution, but if they're not we should be able to see that by comparing the log-likelihoods. That is, if we find that the log-likelihood is generally better with Stata’s cmmixlogit than logitr, that means logitr is just stopping pre-maturely. That may be able to be addressed by changing the optimizer convergence settings. If not, then I could consider using a different optimization routine. I chose NLopt because it is so well-established and well-tested in the open source world. If on the other hand the log-likelihood values are looking about the same, then we can narrow in on the draws (or perhaps something else even).

JediKnight5 commented 11 months ago

I had a look at the codes of logitr and mlogit to better understand how to specify the respective standardDraws and the random.nb options.

My understanding is that logitr is expecting a matrix with dimensions NumDraws number of explanatory variables in the model as input for the standardDraws option. @jhelvy: Can you confirm this? In this case, I have two related questions: 1) What is the purpose of drawing standard normal for fixed coefficient variables? Practicality reasons? I saw that you zero them out in line 27 of https://github.com/jhelvy/logitr/blob/master/R/draws.R 2) My understanding is that we would need in total $K$ $N$ $R$ standard normal draws where $K$ is the number of random coefficient variables, $N$ the number of individuals of a balanced panel dataset and $R$ the number of repetitions / draws. In other words, we would need a matrix of $R$ $K$ standard normals for every individual time combination.

Regarding mlogit’s random.nb option, I think this option has erroneously no impact at all. It is not called in any line of https://github.com/cran/mlogit/blob/0ebddceda198eb567975e8f7da586d29cd88fda3/R/mlogit.R and it seems that random.nb is always generated by the make.random.nb function of https://github.com/cran/mlogit/blob/0ebddceda198eb567975e8f7da586d29cd88fda3/R/lnl.rlogit.R. Moreover, passing a meaningless argument type to this option, such as a string, doesn't lead to a code break or at least a warning.

Disclaimer: I have spend a considerably amount of time to look at both codes, but please apologize if I have overlooked something given the number of lines and code nestings.

jhelvy commented 11 months ago

I really appreciate the time you've spent on this. I know this is super tedious and time-consuming, but it's really helpful. I'm quite swamped at the moment with a lot of other responsibilities, so this would take me a lot longer to get to without your help.

Yes, you are correctly understanding how I have set up the creation of the standardDraws matrix. In reviewing my code, it does seem redundant to be taking standard draws for the fixed (non-random) coefficients. They get zeroed-out in line 39 of the draws.R file. We still need them because computing the utilities and log-likelihoods for a model with any random parameters is a different function, and it's much faster to do everything in matrixes. So eventually those 0s get replaced with the fixed coefficient repeated.

But taking draws for them may be an issue. It may be better / more appropriate to only take normal draws for the random coefficients and just bind on a matrix of zeros for the fixed parameters. In fact, this could very well be part of the problem. Just changing how this is done in the getStandardDraws function may fix things.

As for the dimensions, I am only taking $R$ * $K$ standard normals in total, not for every individual - time combination. This has to do with how I set up the calculations for computing the utilities and ultimately the log-likelihoods and gradients. I am fairly sure this is an appropriate approach.

JediKnight5 commented 11 months ago

I actually cannot see how your described procedure of drawing standard normal draws for fixed coefficients and then replacing them with repeated means should adversely affect the estimation.

After having another look into your code, I realized that I struggle to understand the purpose of the makePartials function and why it takes as argument a matrix of $R * K$ standard normals. You seem to be very convinced regarding this approach, so might I ask you to share a reference or some additional explanation? My approach is the one briefly described on slide 24 of this slide deck but they might exist better / more efficient ways.

Given the ineffective random.nb option of the mlogit function, I have reached out to the mlogit author. Unless a change to the code is made either by him or by us creating a modified version, our idea of comparing mlogit and logitr estimates using the same set of draws is impeded. While this seems to be just a minor modification, I would prefer to rely on the author and his knowledge on his code instead of rewriting it by myself.

jhelvy commented 11 months ago

Okay, I think we may be getting down to something here. First, the makePartials function is essentially storing a pre-computed part of the gradients that doesn't change across iterations of the search algorithm. You can see on page 9 of the JSS article on logitr what I mean by this:

partials

That page only shows the general idea for simple logit, but the idea should hopefully be clear. For mixed logit, part of the partial derivatives involves multiplying by the standard normal draws. So effectively you end up with a big matrix of constants that you then use when computing the gradients inside the optimization search. This is one of the big ways to speed up computations.

However, this may be where there is an issue. I'll have to go back to how I computed the gradients, but I believe I set this up where I don't have different draws across individuals. This may not matter in a case without a panel structure, but in your case where you do have the panel structure (denoted with the panelID input) this may cause issues.

One quick check on this would be to simulate a dataset without the panel structure to see if the bias goes away. If it does, then that very well may be where the issue lies.

jhelvy commented 11 months ago

Btw, I built this all based on Train's book. You can see from section 6.6 of chapter 6 where he describes the estimation process. In 6.7 on panel data he writes:

The simplest specification treats the coefficients that enter utility as varying over people but being constant over choice situations for each person.

And under equation 6.3 he writes:

The only difference between a mixed logit with repeated choices and one with only one choice per decision maker is that the integrand involves a product of logit formulas, one for each time period, rather than just one logit formula. The probability is simulated similarly to the probability with one choice period. A draw of β is taken from its distribution. The logit formula is calculated for each period, and the product of these logits is taken. This process is repeated for many draws, and the results are averaged.

jhelvy commented 11 months ago

Alright I just ran another simulation where I removed the panel structure from the data. It was 10000 people, each with only one choice observation. The same results show - slight bias on the SD term. Still rather puzzled about why this outcome remains so consistent, but I don't think it's the panel structure.

JediKnight5 commented 11 months ago

Okay, I think we may be getting down to something here. First, the makePartials function is essentially storing a pre-computed part of the gradients that doesn't change across iterations of the search algorithm. You can see on page 9 of the JSS article on logitr what I mean by this: ....

Thanks for this reference, this was very helpful and made your estimation strategy clear for the simple case. Moreover, I got a better understanding of your code.

I also run a simulation with $N = 20000$ and $T=1$ to reflect the total sample size $N*T$ which I had before in my simulation in order not to alter the ratio of $\frac{\sqrt{N*T}}{R}$. I got the following results in terms of absolute bias: mlogit pseudo random numbers draws mlogit halton draws logitr halton draws logitr sobol draws
$\beta_{x1}$ 1% 0% 0% 0%
$\mu{\beta{x2}}$ 2% 0% 1% 0%
$\sigma{\beta{x2}}$ 4% 0% 2% 4%

I will double check on that by running another simulation but I think there two are takeaways:

JediKnight5 commented 11 months ago

Btw, I built this all based on Train's book. You can see from section 6.6 of chapter 6 where he describes the estimation process. In 6.7 on panel data he writes:

The simplest specification treats the coefficients that enter utility as varying over people but being constant over choice situations for each person.

And under equation 6.3 he writes:

The only difference between a mixed logit with repeated choices and one with only one choice per decision maker is that the integrand involves a product of logit formulas, one for each time period, rather than just one logit formula. The probability is simulated similarly to the probability with one choice period. A draw of β is taken from its distribution. The logit formula is calculated for each period, and the product of these logits is taken. This process is repeated for many draws, and the results are averaged.

Thanks for this further explanation and reference. My simulation actually consists of this simplest specification where the random coefficient $\beta_{x2_n}$ only varies across people.

jhelvy commented 11 months ago

Interesting. How many draws were you using in the above simulation?

So the fact that you're getting similar levels of bias in mlogit (upwards of 4%) using pseudo random numbers that then goes to 0 with Halton draws suggests that it very well could be the draws that are leading to a slight bias. Like, just by changing how the draws are made (without changing anything), you can get a 4% bias, which is what we're getting with logitr. This unfortunately makes it all the harder to get down to what the real problem is. I'm using halton and sobol draws from a well-vetted external package, and I'm pretty sure I've correctly implemented those draws.

The only thing left I can think to do is re-run your simulations with the three different packages and keep the log-likelihood values. We should be able to see whether we are getting the same solutions across the runs for each package. If logitr is not converging to the same solution then the problem may be in the optimization part.

JediKnight5 commented 11 months ago

First of all, sorry for the delayed response. Unfortunately, I have been super busy in recent days.

I used 100 draws in the simulation above.

To double check the finding of the estimator's behavior in a cross-sectional vs. a panel dataset I did the following: I used the data set posted in the parquet file above and replaced the id column with the values of the id_t column. Moreover, I changed mlogit's panel option to FALSE, logitr's panelID option to NULL and used column id as argument for the obsID option. This allows to compare the estimator's performance when varying the data structure and holding everything else fixed such as $N*T = 20.000$.

Results using panel data structure: (Number of draws for every estimator: 100)

 bias_panel
      estimator beta_x1 mean_x2  sd_x2
1    mlogit_psn  0.0001 -0.0045 0.0143
2     mlogit_ha  0.0010 -0.0040 0.0110
3     logitr_ha  0.0008  0.0206 0.0262
4     logitr_so  0.0007  0.0026 0.0417
5 cmmixlogit_hm  0.0010  0.0005 0.0061
6 cmmixlogit_ha  0.0010  0.0000 0.0042

psn: refers to pseudo random numbers; ha: refers to Halton sequence; hm: refers to Hammersley sequence

grafik grafik grafik

This displays which estimator achieves the maximum likelihood in the 100 iterations:

> table(res_panel$max_ll)

logitr_ha_ll logitr_so_ll mlogit_ha_ll 
          41           19           40 

Results using cross-sectional data structure: (Number of draws for every estimator: 100)

> bias_cs
      estimator beta_x1 mean_x2   sd_x2
1    mlogit_psn -0.0091 -0.0201 -0.0391
2     mlogit_ha  0.0011  0.0001  0.0044
3     logitr_ha  0.0005  0.0065  0.0252
4     logitr_so  0.0002  0.0023  0.0387
5 cmmixlogit_hm  0.0010  0.0005  0.0061
6 cmmixlogit_ha  0.0010  0.0000  0.0042

grafik grafik grafik

This displays which estimator achieves the maximum likelihood in the 100 iterations:

> table(res_cs$max_ll)

cmmixlogit_ha_ll cmmixlogit_hm_ll     logitr_ha_ll     logitr_so_ll     mlogit_ha_ll 
              21               46                6                7               20 

Conclusion:

Something else which I realized when looking at the density plots was that logitr with Halton draws has a much larger spread than the other estimators in case of estimating the mean of x2.

This is a file with all results: res.zip

jhelvy commented 11 months ago

This is fantastic stuff - thanks so much for keeping up the progress on this! This is becoming quite the effort to figure out where the bias is coming from.

Based on these latest results, I'm thinking we may have a convergence issue. When you say "This displays which estimator achieves the maximum likelihood in the 100 iterations", what specifically do you mean? For example, on the cross-sectional data you get these results:

> table(res_cs$max_ll)

cmmixlogit_ha_ll cmmixlogit_hm_ll     logitr_ha_ll     logitr_so_ll     mlogit_ha_ll 
              21               46                6                7               20 

Does that mean that for the logitr with Halton draws case, only 7 out of the 100 iterations reached the maximum log-likelihood value? How do you determine what the maximum log-likelihood at the global solution is? Is it based on exit code statuses?

Also you say:

Surprisingly, the likelihood by cmmixlogit is always more negative than the one of mlogit and logitr. The absolute difference between the likelihood values of mlogit with Halton draws and logitr with Halton draws is on average absolutely modest: 1.33.

So it sounds like mlogit and logitr are reaching rather similar solutions, but how does that compare to the solutions reached by cmmixlogit?

If you're finding that the likelihood by cmmixlogit is always more negative than the one of mlogit and logitr, then that suggests neither mlogit nor logitr are reaching the global solution - they're reaching local solutions and stopping. It very well may be that Stata just has tighter tolerances on its stopping criteria. And if logitr is stopping too early, then that could explain why it consistently results in biased estimates for some of the parameters.

You can modify the tolerances for logitr using the options argument:

The the options argument is used to control the detailed behavior of the optimization and must be passed as a list, e.g. options = list(...). Below are a list of the default options, but other options can be included. Run nloptr::nloptr.print.options() for more details.

Argument Description Default
xtol_rel The relative x tolerance for the nloptr optimization loop. 1.0e-6
xtol_abs The absolute x tolerance for the nloptr optimization loop. 1.0e-6
ftol_rel The relative f tolerance for the nloptr optimization loop. 1.0e-6
ftol_abs The absolute f tolerance for the nloptr optimization loop. 1.0e-6
maxeval The maximum number of function evaluations for the nloptr optimization loop. 1000
algorithm The optimization algorithm that nloptr uses. "NLOPT_LD_LBFGS"
print_level The print level of the nloptr optimization loop. 0

What if you set all the tolerances tighter to like 1.0e-8 or 1.0e-10? What are the Stata tolerance levels?

Maybe the draws aren't the problem. If we're pooling together a handful of results that didn't fully converge with ones that did, then it's kind of an apples and oranges thing. I feel like this very well may be happening, especially because the variance is so much wider in some plots, like the mean_x2 one for logitr. Maybe that wider variance is just from a greater number of the models stopping the gradient descent too early?

JediKnight5 commented 11 months ago

Based on these latest results, I'm thinking we may have a convergence issue. When you say "This displays which estimator achieves the maximum likelihood in the 100 iterations", what specifically do you mean? For example, on the cross-sectional data you get these results:

> table(res_cs$max_ll)

cmmixlogit_ha_ll cmmixlogit_hm_ll     logitr_ha_ll     logitr_so_ll     mlogit_ha_ll 
              21               46                6                7               20 

Does that mean that for the logitr with Halton draws case, only 7 out of the 100 iterations reached the maximum log-likelihood value? How do you determine what the maximum log-likelihood at the global solution is? Is it based on exit code statuses?

Please apologize any confusion! What I meant is that if you compare the log likelihood of the six different estimators over all 100 cross-sectional data iterations, in 21 out of 100 iterations the cmmixlogit estimator with Halton draws achieves the relative highest likelihood, in 46 out 100 iterations the cmmixlogit with Hammersely draws achieves the relative highest likelihood etc. This is the way I created this indicator

res_cs$max_ll <- c('mlogit_psn_ll','mlogit_ha_ll','logitr_ha_ll','logitr_so_ll','cmmixlogit_hm_ll','cmmixlogit_ha_ll')[apply(res_cs[,c('mlogit_psn_ll','mlogit_ha_ll','logitr_ha_ll','logitr_so_ll','cmmixlogit_hm_ll','cmmixlogit_ha_ll')], MARGIN = 1, FUN = which.max)]

which can be reproduced given the results RData file in my last post.

Also you say:

Surprisingly, the likelihood by cmmixlogit is always more negative than the one of mlogit and logitr. The absolute difference between the likelihood values of mlogit with Halton draws and logitr with Halton draws is on average absolutely modest: 1.33.

So it sounds like mlogit and logitr are reaching rather similar solutions, but how does that compare to the solutions reached by cmmixlogit?

I agree with your statement referring to mlogitand logitr. In case of the cross-sectional dataset structure, the cmmixlogit estimators typically achieve a relatively more optimal solution in terms of higher likelihood. However, the differences in the likelihood are super small across the estimators (with the exception of the mlogit estimator with pseudo random numbers. In fact, the average absolute difference between the log likelihood of the mlogit and logitr estimators to the log likelihood of the cmmixlogit estimator with Halton draws each is 1.09 and 1.22 respectively for the cross-sectional dataset. This finding can also be seen by the following density plot which shows a super large overlap of the density across the different estimators.

grafik

In other words, this suggests that there is no convergence issue for the cross-sectional dataset. All estimators with the exception of mlogit PSN converge to the same solution. Thus, a local minima doesn't seem to be the explanation for the biased logitr SD parameter estimate of x2.

The opposite is clearly the case for the panel dataset structure. As mentioned above, the performance and therefore the likelihood distribution of cmmixlogit is exactly the same as for the cross-sectional dataset. However, mlogit and logitr are converging to a different solution. The average absolute difference between the log likelihood of the mlogit and logitr estimators to the log likelihood of the cmmixlogit estimator with Halton draws increases to 1480 for both estimators. This is again a density plot of the log likelihood of the different estimators:

grafik

If you're finding that the likelihood by cmmixlogit is always more negative than the one of mlogit and logitr, then that suggests neither mlogit nor logitr are reaching the global solution - they're reaching local solutions and stopping. It very well may be that Stata just has tighter tolerances on its stopping criteria. And if logitr is stopping too early, then that could explain why it consistently results in biased estimates for some of the parameters.

What if you set all the tolerances tighter to like 1.0e-8 or 1.0e-10? What are the Stata tolerance levels?

cmmixlogit is using Stata's ml command. You can find the documentation of the tolerance levels here: https://www.stata.com/manuals/rmaximize.pdf According to my understanding this is the mapping of the tolerance parameters across the different estimators with their default values in brackets (if available):

cmmixlogit mlogit logitr
tolerance (1e-4) steptol xtol_abs (1e-6)
ftolerance (0) ftol ftol_abs (1e-6)
nrtolerance (1e-5) tol ?

Do you agree with this mapping and is there something equivalent of cmmixlogit's nrtolerance parameter of mlogit's parameter for logitr ? I am happy to run new simulations for mlogit and logitr using the tolerance values of cmmixlogit.

Maybe the draws aren't the problem. If we're pooling together a handful of results that didn't fully converge with ones that did, then it's kind of an apples and oranges thing. I feel like this very well may be happening, especially because the variance is so much wider in some plots, like the mean_x2 one for logitr. Maybe that wider variance is just from a greater number of the models stopping the gradient descent too early?

Given the mlogit PSN estimator performance in the cross-sectional dataset, I think we can clearly see that the draws can have an impact unless you are relying on Halton, Hammersely or Sobol draws. While in principal I would agree with your statement above (apple vs orange comparison), I don't think it can be the explanation for the spread of the mean_x2 here. As you can see by the log likelihood density plots above, the estimators are converging all to the same solution in the cross-sectional dataset (besides mlogit PSN). Thus, at least in the cross-sectional dataset I would argue we are comparing apple to apple.

My preliminary conclusion would be:

BTW thanks for ongoing support! Much appreciated!

jhelvy commented 11 months ago

Okay I was totally mis-understanding your LL results. That makes much more sense. So for sure cmmixlogit appears to converge much more often to the "best" LL, especially in the panel structure case, though differences are rather small in the cross-sectional case.

No doubt there's definitely something different going on between the cross-sectional vs. panel structure. It's probably okay to use the same data but classify them differently for this experiment. The panel structure requires a different log-likelihood calculation, so by adding in the panelID in logitr you should be getting a different result.

With these results now, I have some ideas for where things may be doing wrong:

  1. I'm still not sure if the way I'm setting up the halton draws is correct (i.e. by defining a bunch of them but then zeroing out the fixed pars). I'm going to play around with a different version where I only make draws for the random pars. I doubt this is a problem, but I'm not 100% certain, and it very well may be accidentally creating some strange structure in the draws.
  2. The draws for the panel structure log-likelihood may be wrong. As we discussed earlier, you noted we should have in total $K$ $N$ $R$ standard normal draws where $K$ is the number of random coefficient variables, $N$ the number of individuals of a balanced panel dataset and $R$ the number of repetitions / draws. But I've only been taking $R$ * $K$ standard normals in total. So perhaps this is creating the bias since we're not using different draws across different individuals. This is my prime suspect for the problem.

One reason I think the bias is so much worse with panel structure data is that logitr appears to struggle to converge with panel data. Take a look at how inconsistently I get convergence with a panel structure in logitr. If I use my simple yogurt example from my documentation, I very consistently achieve the same solution across a 10-start multistart:

library(logitr)

set.seed(456)

mxl_pref <- logitr(
  data     = yogurt,
  outcome  = 'choice',
  obsID    = 'obsID',
#  panelID  = 'id',
  pars     = c('price', 'feat', 'brand'),
  randPars = c(feat = 'n', brand = 'n'),
  numMultiStarts = 10
)

summary(mxl_pref)

The multistart summary looks like this:

Summary Of Multistart Runs:
   Log Likelihood Iterations Exit Status
1       -2641.243         38           3
2       -2641.504         29           3
3       -2641.246         26           3
4       -2641.502         31           3
5       -2642.052         39           3
6       -2641.371         28           3
7       -2642.046         49           3
8       -2641.504         28           3
9       -2641.243         29           3
10      -2641.505         28           3

But if I put the panel structure back in:

library(logitr)

set.seed(456)

mxl_pref <- logitr(
  data     = yogurt,
  outcome  = 'choice',
  obsID    = 'obsID',
  panelID  = 'id',
  pars     = c('price', 'feat', 'brand'),
  randPars = c(feat = 'n', brand = 'n'),
  numMultiStarts = 10
)

summary(mxl_pref)

Now the multistart summary looks like this:

Summary Of Multistart Runs:
   Log Likelihood Iterations Exit Status
1       -1266.550         34           3
2       -1300.751         64           3
3       -1260.216         35           3
4       -1261.216         43           3
5       -1269.066         40           3
6       -1239.294         56           3
7       -1343.221         59           3
8       -1260.006         55           3
9       -1273.143         52           3
10      -1304.384         59           3

It's bouncing all over the place. So one reason we're seeing logitr bias show up a lot more in the panel case may be from the fact that we're just not reaching the global min in a lot of the runs. You're using a 30-start multistart, but it could very well still be converging to a local min that is still not as good as the cmmixlogit min. This is for sure a big problem. Users shouldn't need to run a huge multistart to find a better solution, and it should be converging more consistently. This may all be due to the issue with the draws, but it may also be a bug in how I compute the LL.

jhelvy commented 11 months ago

Okay I'm a bit out of ideas. I spent the day implementing a version (this branch) where I take different sets of $R$ normal draws for each of $N$ individuals. That version is incomplete in that other features don't fully work yet (e.g. predict, summary, etc.), but it does compute the LL and the gradients correctly. I do believe this is actually the "more" correct approach according to the slides you shared earlier. On slide 21, it is emphasized that each decision maker should have their own set of draws. I say "more" correct because it really is about convergence. I think it is technically fine either way, but keeping the same standard draws for each respondent should help with convergence. But after implementing this in logitr, I still get the same outcome where the LL bounces around quite a lot across different multi-start runs. Decreasing the tolerances also doesn't change this outcome. I'm really running out of ideas (and time) on this. My only thought is that if you're using panel data, just make sure you use a really large number of multi-starts (~100 or so) to make sure you explore the solution space really well. If you do that in your simulations, it ought to reduce the bias at least some since more of the trials should converge to a global rather than local solution. But there's no guarantee. I'm not sure what the cmmixlogit algorithm does to achieve the consistent convergence.

jhelvy commented 11 months ago

Okay, I've gone back and attempted to simulate my own experiment from scratch just to make sure that I'm not missing something. I believe everything you had before is correct, but perhaps we've done something differently because in my simulations I'm no longer getting really a bias at all. In this experiment, I have two random parameters (x2 and x3) and I'm using 200 Halton draws. Across 100 runs, here's what I get for mean estimates:

beta_x1 1.003579  1.000353
mean_x2 0.9819966 1.002053
mean_x3 0.9709964 0.9959077
sd_x2   1.028509  1.015882
sd_x3   1.005578  0.9973298

The two columns are cross sectional vs. panel. Here I use logitr version 1.0.1 and the only difference is that in the "panel" one I include panelID = 'id'. Including the panel structure does make a difference as the means are slightly biased when you ignore it. As for the panel model, it looks like only the SD term of x2 has a very slight bias of 1.6%. That's not zero, but it's not that high either, and with more draws it may go away completely. The other terms have essentially zero bias.

I'm not really sure what to make of this. Maybe we're just simulating the data differently? I wrote my code to simulate the data efficiently, so it gets generated within each run of the simulation, and I simulated the data in a way such that there is a panel structure (the coefficients are held constant for individuals but vary across individuals). Here is my code if you want to run it and check the results:

library(logitr)
library(tidyverse)

options(dplyr.width = Inf)

set.seed(4321)

# Define functions

sim_choice_data <- function(n, t, j, beta_x1, mean_x2, sd_x2, mean_x3, sd_x3) {

  # Same betas for x2 across population

  beta_x2 <- rnorm(n, mean_x2, sd_x2)
  beta_x3 <- rnorm(n, mean_x3, sd_x3)

  # Generate the X data matrix

  X <- as.data.frame(matrix(rnorm(n*t*j*3), ncol = 3))
  names(X) <- c('x1', 'x2', 'x3')
  X$id <- rep(seq(n), each = t*j)
  X$alt_id <- rep(seq(j), t*n)
  X$q_id <- rep(rep(seq(t), each = j), n)
  X$obs_id <- rep(seq(n*t), each = j)

  # Compute utility by respondent 

  X_resp <- split(X[c('x1', 'x2', 'x3')], X$id)
  utility <- list()
  for (i in 1:n) {
    beta <- c(beta_x1, beta_x2[i], beta_x3[i])
    utility[[i]] <- as.matrix(X_resp[[i]]) %*% beta
  }
  V <- unlist(utility)

  # Compute probabilities of choosing each alternative

  expV <- exp(V)
  sumExpV <- rowsum(expV, group = X$obs_id, reorder = FALSE)
  reps <- table(X$obs_id)
  p <- expV / sumExpV[rep(seq_along(reps), reps),]

  # Simulate choices according to probabilities

  p_obs <- split(p, X$obs_id)
  choices <- list()
  for (i in seq_len(length(p_obs))) {
    choice <- rep(0, reps[i])
    choice[sample(seq(reps[i]), 1, prob = p_obs[[i]])] <- 1
    choices[[i]] <- choice
  }
  X$choice <- unlist(choices)

  return(as.data.frame(X))
}

estimate_mixed_logit <- function(
  n, t, j, beta_x1, mean_x2, sd_x2, mean_x3, sd_x3, numDraws
) {

  # Simulate choice data
  data <- sim_choice_data(n, t, j, beta_x1, mean_x2, sd_x2, mean_x3, sd_x3)

  # Cross sectional - just ignoring panel structure
  suppressMessages(
    m1 <- logitr(
      data     = data,
      outcome  = 'choice',
      obsID    = 'obs_id',
      # panelID  = 'id',
      pars     = c('x1', 'x2', 'x3'),
      randPars = c(x2 = 'n', x3 = 'n'),
      numDraws = numDraws,
      numMultiStarts = 1
  ))

  # Now include panelID
  suppressMessages(
    m2 <- logitr(
      data     = data,
      outcome  = 'choice',
      obsID    = 'obs_id',
      panelID  = 'id',
      pars     = c('x1', 'x2', 'x3'),
      randPars = c(x2 = 'n', x3 = 'n'),
      numDraws = numDraws,
      numMultiStarts = 10
  ))

  return(
    list(
      coef = cbind(coef(m1), coef(m2)), 
      ll = cbind(logLik(m1), logLik(m2))
    )
  )

}

R <- 100
results <- list()
numDraws <- 200
for (r in 1:R) {
  results[[r]] <- estimate_mixed_logit(
    n = 1000,
    t = 10,
    j = 3,
    beta_x1 = 1,
    mean_x2 = 1,
    sd_x2 = 1,
    mean_x3 = 1,
    sd_x3 = 1,
    numDraws = numDraws
  )
}

To view the means across all runs I used this:

cat('beta_x1', apply(do.call(rbind, map(results, \(x) x$coef[1,])), 2, mean))
cat('mean_x2', apply(do.call(rbind, map(results, \(x) x$coef[2,])), 2, mean))
cat('mean_x3', apply(do.call(rbind, map(results, \(x) x$coef[3,])), 2, mean))
cat('sd_x2', apply(do.call(rbind, map(results, \(x) abs(x$coef[4,]))), 2, mean))
cat('sd_x3', apply(do.call(rbind, map(results, \(x) abs(x$coef[5,]))), 2, mean))

And I also made a plot of all the parameters across the 100 runs:

coefs <- map_df(
    results, \(x) {
        temp <- data.frame(x$coef)
        temp$var <- row.names(temp)
        row.names(temp) <- NULL
        temp
    }
)
coefs$run <- rep(seq(R), each = 5)
names(coefs)[1:2] <- c('cross_section', 'panel')
coefs |> 
    pivot_longer(
        names_to = 'method', 
        values_to = 'estimate',
        cols = cross_section:panel
    ) |> 
    mutate(estimate = abs(estimate)) |> 
    ggplot() +
    geom_density(aes(x = estimate, color = method)) +
    facet_wrap(vars(var), nrow = 1) +
    theme_bw()

plot

You can see how accounting for the panel structure matters here as the cross-sectional model that ignores the structure has negative biases. And you can also see the slight positive bias in the sd_x2 term. That bulge off to the right is similar to what you were seeing in your simulations. So perhaps this is all a fluke of the specific Halton sequence? Like, given the Halton sequence used, maybe the draws for sd_x2 ends up being a poor approximation for a normal, but the draws for other terms are better? Why would sd_x2 be biased but all the other terms be unbiased? If you add a second random parameter (x3) to your simulations and you get the same results, then I think we may have found the issue. I'm starting to think it's just coming down to how the specific standard normal draws are taken in these sequences (Halton, Sobol, etc.). It just comes out in a way where the sd_x2 term ends up slightly biased, but all other terms are fine. If that is the case, then all I can do is try to find an alternative for Halton draws or find a way to modify how the draws are taken to eliminate this bias (though it is rather small).

jhelvy commented 11 months ago

Okay I just found an error in my simulation. I'll fix it and will edit the above comment.

I just fixed the error and updated my comment above.

jhelvy commented 10 months ago

Hi @JediKnight5, I've been super busy and haven't had any time to look back at this. Just curious if you could re-run your experiments with two random variables instead of just one as I did here in my latest post. I just wanted to see if you get a bias on the mean and / or sd parameters of both or only one of the random variables. If it's only there on one but not both, then I'd be fairly confident this whole issue might just come down to the specifics of how the standard normals are drawn. I sure hope that's the issue as that's a lot easier to fix.

JediKnight5 commented 10 months ago

Hi @jhelvy , sorry, I have been very busy in the last two weeks. I'll definitely reply to your most recent post, hopefully in the next couple of days.

JediKnight5 commented 10 months ago

Hi @jhelvy, my apologies again, I have been very busy in the last two weeks.

Regarding this post:

Okay, I've gone back and attempted to simulate my own experiment from scratch just to make sure that I'm not missing something. I believe everything you had before is correct, but perhaps we've done something differently because in my simulations I'm no longer getting really a bias at all. In this experiment, I have two random parameters (x2 and x3) and I'm using 200 Halton draws. Across 100 runs, here's what I get for mean estimates: ...

I think your simulation code is fine. I couldn’t spot anything which doesn’t seem right to me. However, I think your bias estimates are somewhat less comparable to mine above since you were using 200 draws while I was using just 100 draws and as we know by means of this analysis taken from the SO post, a larger number of draws decreases the bias.

If you add a second random parameter (x3) to your simulations and you get the same results, then I think we may have found the issue.

These are the bias results of my simulation which I run for both, 100 and 200 draws, to make it most comparable:

100 draws cross- sectional

  | mlogit PSN | mlogit HA | logitr HA | logitr SO -- | -- | -- | -- | -- beta_x1 | -0.0254 | 0.0028 | 0.0022 | 0.0016 mean_x2 | -0.0408 | -6e-04 | -0.0024 | 3e-04 sd_x2 | -0.063 | 0.0099 | 0.0331 | 0.041 mean_x3 | -0.0366 | 0.0028 | 0.0062 | 0.0037 sd_x3 | -0.0766 | -0.0074 | 0.0068 | 0.0262

100 draws panel

  | mlogit PSN | mlogit HA | logitr HA | logitr SO -- | -- | -- | -- | -- beta_x1 | -0.004 | 5e-04 | 7e-04 | -1e-04 mean_x2 | -0.0263 | -0.0179 | -0.0278 | -0.0099 sd_x2 | 0.0189 | 0.0211 | 0.0434 | 0.0534 mean_x3 | -0.0332 | -0.0151 | -0.008 | -0.0142 sd_x3 | 0.0167 | 0.0104 | 0.0164 | 0.0416

200 draws cross-sectional

  | mlogit PSN | mlogit HA | logitr HA | logitr SO -- | -- | -- | -- | -- beta_x1 | -0.0112 | 0.0037 | 0.0026 | 0.0027 mean_x2 | -0.0206 | 7e-04 | -0.0038 | -5e-04 sd_x2 | -0.0265 | 0.0119 | 0.0255 | 0.0303 mean_x3 | -0.0178 | 0.0043 | 0.004 | 0.0054 sd_x3 | -0.0409 | -0.0052 | -0.0024 | 0.0047

200 draws panel

  | mlogit PSN | mlogit HA | logitr HA | logitr SO -- | -- | -- | -- | -- beta_x1 | -7e-04 | 0.002 | 0.0019 | 0.0017 mean_x2 | -0.0173 | -0.0073 | -0.0074 | -0.0061 sd_x2 | 0.0211 | 0.0154 | 0.0289 | 0.0346 mean_x3 | -0.0144 | -0.0049 | 0.004 | -1e-04 sd_x3 | 0.0049 | 0.0034 | 0.0052 | 0.0127

I think we can see again the beneficial impact of increasing the number of draws. However, the bias in the logitr SD estimates seems to be more substantial, particularly for x2. Literally, I have no clue why it is so different to the one of x3.

This is the code I used to generate a parquet file with all data:

# Load necessary packages
library(tidyverse)
library(mlogit)
library(logitr)
library(EnvStats)
library(parallel)
library(arrow)

# Define functions
data_generator <- function(seed, N, T, J, beta_x1, mean_x2, sd_x2, mean_x3, sd_x3){

  # Setting parameters
  set.seed(seed)

  n <- seq(1:N)
  t <- seq(1:T)
  j <- seq(1:J)

  # Creation of the data frame
  df <- expand.grid(n, t, j)
  names(df) <- c('id', 't', 'alt')

  df <- df %>%
    group_by(id) %>%
    mutate(ran_coef_x2 = rnorm(1,mean_x2,sd_x2),
           ran_coef_x3 = rnorm(1,mean_x3,sd_x3) ) %>%
    group_by(id,t,alt) %>%
    mutate(x1 = rnorm(1),
           x2 = rnorm(1),
           x3 = rnorm(1),
           e = rgevd(1),
           y_star = case_when(alt == 1 ~ 0 + beta_x1 * x1 + ran_coef_x2 * x2 + ran_coef_x3 * x3 + e, 
                              alt == 2 ~ 0 + beta_x1 * x1 + ran_coef_x2 * x2 + ran_coef_x3 * x3 + e,
                              alt == 3 ~ 0 + beta_x1 * x1 + ran_coef_x2 * x2 + ran_coef_x3 * x3 + e)
    ) %>%
    group_by(id,t) %>%
    mutate(max_y_star = max(y_star),
           choice = ifelse(max_y_star == y_star,1,0),
           id_t = cur_group_id()) %>%
    arrange(id,t,alt) %>% 
    ungroup() %>%
    mutate(seed = seed) %>%
    select(seed,id_t, id, t, alt, ran_coef_x2, ran_coef_x3, x1, x2, x3, e, y_star, max_y_star, choice)

    return(df)

}

# Set global parameters

N <- 1000
T <- 20
J <- 3

beta_x1 <- 1
mean_x2 <- 1
sd_x2 <- 1
mean_x3 <- 1
sd_x3 <- 1

# Get seeds from the previous analysis
df_twovariables <- read_parquet('n1000_t20_e3_twovariables_panel.parquet')

# Generate the data
dd <- list()

j <- 1
for(r in unique(df_twovariables$seed)){
    message(j)
    dd[[j]] <- data_generator(seed = r, 
                         N = N, 
                         T = T, 
                         J = J, 
                         beta_x1 = beta_x1, 
                         mean_x2 = mean_x2, 
                         sd_x2 = sd_x2,
                         mean_x3 = mean_x3, 
                         sd_x3 = sd_x3)
    j <- j + 1
}

data <- as.data.frame(do.call(rbind, dd))

# Export data
write_parquet(data, 'n1000_t20_e3_threevariables_panel.parquet')

This is the code I used for estimation for the panel case and 200 draws (the code for the cross-sectional dataset slightly varies by means of using different options) :

# Load necessary packages
library(tidyverse)
library(mlogit)
library(logitr)
library(arrow) 

# Load parquet file
df_all <- read_parquet('n1000_t20_e3_threevariables_panel.parquet')
df_all$choice <- as.logical(df_all$choice)

# Define functions
estimate_mixed_logit <- function(s, df_all, nummultistarts, numdraws){

  # Subset dataframe
  df <- df_all %>% filter(seed == s) 

  # Mlogit estimation

  ## Reshaping the data in mlogit format
  df_cm <- dfidx(df, 
                 idx = list(c('id_t','id')),
                 chid.var = 'id',
                 alt.var = 'alt', 
                 group.var = 't',
                 choice = 'choice',
                 drop.index = TRUE)

  ## Estimation
  rpar <- c('n', 'n')
  names(rpar) <- c('x2', 'x3')

  ### Pseudo random numbers
  mlm_psn <- mlogit(choice ~ 1 + x1 + x2 + x3, data = df_cm, panel = TRUE, correlation = FALSE, rpar = rpar, R = numdraws, halton = NULL)

  ### Halton draws
  mlm_ha <- mlogit(choice ~ 1 + x1 + x2 + x3, data = df_cm, panel = TRUE, correlation = FALSE, rpar = rpar, R = numdraws, halton = NA)

  # Logitr estimation

  ## Halton draws
  suppressMessages(
  logitr_ha <- logitr(
    data     = df,
    outcome  = 'choice',
    obsID    = 'id_t',
    panelID  = 'id',
    pars     = c('x1', 'x2', 'x3'),
    randPars = c(x2 = 'n', x3 = 'n'),
    numMultiStarts = nummultistarts,
    numDraws = numdraws,
    drawType = "halton"
  )
  )  
  ## Sobol draws
  suppressMessages(
  logitr_so <- logitr(
    data     = df,
    outcome  = 'choice',
    obsID    = 'id_t',
    panelID  = 'id',
    pars     = c('x1', 'x2', 'x3'),
    randPars = c(x2 = 'n', x3 = 'n'),
    numMultiStarts = nummultistarts,
    numDraws = numdraws,
    drawType = "sobol"
  )
  ) 

  return(c(s, mlm_psn$coefficients[['x1']], mlm_psn$coefficients[['x2']], abs(mlm_psn$coefficients[['sd.x2']]), mlm_psn$coefficients[['x3']], abs(mlm_psn$coefficients[['sd.x3']]), mlm_psn$est.stat$nb.iter, as.numeric(mlm_psn$logLik),
           mlm_ha$coefficients[['x1']], mlm_ha$coefficients[['x2']], abs(mlm_ha$coefficients[['sd.x2']]),  mlm_ha$coefficients[['x3']], abs(mlm_ha$coefficients[['sd.x3']]), mlm_ha$est.stat$nb.iter, as.numeric(mlm_ha$logLik),
           logitr_ha$coefficients[['x1']], logitr_ha$coefficients[['x2']], abs(logitr_ha$coefficients[['sd_x2']]), logitr_ha$coefficients[['x3']], abs(logitr_ha$coefficients[['sd_x3']]), logitr_ha$status, logitr_ha$logLik,
           logitr_so$coefficients[['x1']], logitr_so$coefficients[['x2']], abs(logitr_so$coefficients[['sd_x2']]), logitr_so$coefficients[['x3']], abs(logitr_so$coefficients[['sd_x3']]), logitr_so$status, logitr_so$logLik
          ))

}

# Set global parameters
nummultistarts <- 30
numdraws <- 200

# Run Monte Carlo simulations
res <- list()

j <- 1
for(r in unique(df_all$seed)){
    message(j)
    message(r)
    message(Sys.time())
    res[[j]] <- estimate_mixed_logit(s = r,
                                     df_all = df_all,
                                     nummultistarts = nummultistarts,
                                     numdraws = numdraws
                                   )
    j <- j + 1
}

# Summarize the results
res <- as.data.frame(do.call(rbind, res))

cbind(
    c('', 'beta_x1', 'mean_x2', 'sd_x2', 'mean_x3', 'sd_x3'),
    c('mlogit PSN', as.character(round(mean(res$mlogit_psn_beta_x1 -1),4)), as.character(round(mean(res$mlogit_psn_mean_x2 -1),4)), as.character(round(mean(res$mlogit_psn_sd_x2 -1),4)), as.character(round(mean(res$mlogit_psn_mean_x3 -1),4)), as.character(round(mean(res$mlogit_psn_sd_x3 -1),4))),
    c('mlogit HA', as.character(round(mean(res$mlogit_ha_beta_x1 -1),4)), as.character(round(mean(res$mlogit_ha_mean_x2 -1),4)), as.character(round(mean(res$mlogit_ha_sd_x2 -1),4)), as.character(round(mean(res$mlogit_ha_mean_x3 -1),4)), as.character(round(mean(res$mlogit_ha_sd_x3 -1),4))),
    c('logitr HA', as.character(round(mean(res$logitr_ha_beta_x1 -1),4)), as.character(round(mean(res$logitr_ha_mean_x2 -1),4)), as.character(round(mean(res$logitr_ha_sd_x2 -1),4)), as.character(round(mean(res$logitr_ha_mean_x3 -1),4)), as.character(round(mean(res$logitr_ha_sd_x3 -1),4))),
    c('logitr SO', as.character(round(mean(res$logitr_so_beta_x1 -1),4)), as.character(round(mean(res$logitr_so_mean_x2 -1),4)), as.character(round(mean(res$logitr_so_sd_x2 -1),4)), as.character(round(mean(res$logitr_so_mean_x3 -1),4)), as.character(round(mean(res$logitr_so_sd_x3 -1),4)))
)

I used version 1.1.0 of logitr in all of my estimations.

So perhaps this is all a fluke of the specific Halton sequence? Like, given the Halton sequence used, maybe the draws for sd_x2 ends up being a poor approximation for a normal, but the draws for other terms are better? Why would sd_x2 be biased but all the other terms be unbiased?

I had another look on the generation of the Halton sequence in the mlogit package and compared it to the randtoolbox package. I notice one subtitle difference: While mlogit is using 3 as the base, the randtoolbox package is using 2 which leads to different draws. I don’t have any theoretical justification for the change in the base but maybe it is worth a try to pass a set of Halton draws with base 3, for instance by using mnorm::halton, to logitr using the standardDraws option? (randtoolbox doesn’t seem to allow a change in the base.)

On a different note, I am still wondering why Stata’s cmmixlogit estimator converges to the same likelihood independent from the nature of the data, i.e. if the exact same data is treated as a cross-sectional or a panel dataset, and achieves unbiased estimates even if the nature of the data is misspecified. After having a look into slides and textbooks again, my understanding is that the log-likelihood in case of a cross-section dataset is different from the one in case of a panel dataset.

Do you agree that from a theoretical point of view we wouldn't expect that log likelihood changes if we use the exactly same data but classify it ones as panel and the other time as a cross-sectional dataset as I did above?

Thus, I take this statement of mine from above back and agree with your statement

The panel structure requires a different log-likelihood calculation, so by adding in the panelID in logitr you should be getting a different result.

Just for clarification on this statement:

So for sure cmmixlogit appears to converge much more often to the "best" LL, especially in the panel structure case, though differences are rather small in the cross-sectional case.

This is not true. In the panel case the cmmixlogit log-likelihood is always smaller than the one of mlogit or logitr. I think this is by definition of the different log-likelihoods. (I can write up the different log-likelihood functions for the cross-sectional and panel case, if necessary.)

Sorry for the long reply. I hope I have addressed all open questions.

jhelvy commented 10 months ago

Thanks again for a detailed look into all of this. Based on everything we've now looked into, it seems there a few points that I think we agree on:

  1. The log-likelihood in case of a cross-section dataset is different from the one in case of a panel dataset.
  2. The bias we've observed has to do with the draws used in simulation and probably not something else, like the convergence criteria. We've explored that rather extensively, and the optimizer appears to be working just fine.
  3. Increasing the draws sufficiently high does indeed remove the bias we've seen in logitr, or at least makes it so small that I would argue it's negligible, except for one of the parameters (x2_sd).
  4. For some unknown reason, the x2_sd parameter still has some slight bias, even with 200 draws, while all other parameters have no bias with a sufficient number of draws. This seems to hold for both logitr and mlogit as well as halton and sobol draws.

On point 1, I know logitr handles this properly, but I'm not sure if mlogit or cmmixlogit does. With regards to the main point of this issue, this is only really relevant for the purposes of comparing log-likelihoods across the different packages. But in any case, it's an important point and may be a reason why there are different log-likelihood results across packages.

On points 2 & 3, I think a general conclusion from this investigation is that we probably should be using more draws, and right now the default number used in logitr of 50 is insufficient. I will probably make an update that uses the same approach as cmmixlogit which calculates a number of draws based on the number of random parameters in the model, etc. I believe making this change will be sufficient to address the core issue of any potential bias in logitr for a mixed logit model, so that is my proposed solution to close out this issue.

On point 4 I remain puzzled as to why the x2_sd parameter is so stubbornly biased (albeit rather small bias). In the original simulations when we only had one random parameter, this was the term that got me concerned about the potential for some systematic bias, but then when we add a second random term we see that the parameters on the second random parameter are unbiased. That's the real puzzle - if there's something systematically wrong with the algorithm, then why isn't the x3_sd term also biased? There is truly no justification for this other than it being some fluke of our simulation experiments. The fact that it appears for both halton and sobol draws is also notable as it suggests it may not have to do with the draws themselves but perhaps could be some strange structure in the simulated data. All that said, if we crank up the number of draws to a rather large number we can probably get it to go away completely, and there work that suggests we should be using far more draws than we are (on the order of thousands, not hundreds). So perhaps it's just the outcome of using too few draws. It does raise the question though why cmmixlogit does not have this bias even with fewer draws. In any case, I don't think the conclusion of this whole exercise is that logitr or mlogit are biased, but rather that users should use more draws than we've been recommending to ensure that the model is obtaining the best estimates.

With all of that said, I think my proposal to update the number of draws to be based on the number of random parameters is an appropriate change to address this issue, and once I've done that I think I can close out this issue (and reply to the original SO post). Do you agree? Anything else you'd recommend?

JediKnight5 commented 10 months ago

Thanks again for a detailed look into all of this. Based on everything we've now looked into, it seems there a few points that I think we agree on:

1. The log-likelihood in case of a cross-section dataset is different from the one in case of a panel dataset.

2. The bias we've observed has to do with the draws used in simulation and probably not something else, like the convergence criteria. We've explored that rather extensively, and the optimizer appears to be working just fine.

3. Increasing the draws sufficiently high does indeed remove the bias we've seen in `logitr`, or at least makes it so small that I would argue it's negligible, except for one of the parameters (`x2_sd`).

4. For some unknown reason, the `x2_sd` parameter still has some slight bias, even with 200 draws, while all other parameters have no bias with a sufficient number of draws. This seems to hold for both `logitr` and `mlogit` as well as halton and sobol draws.

I agree with all points.

On point 1, I know logitr handles this properly, but I'm not sure if mlogit or cmmixlogit does. With regards to the main point of this issue, this is only really relevant for the purposes of comparing log-likelihoods across the different packages. But in any case, it's an important point and may be a reason why there are different log-likelihood results across packages.

I agree. Actually, this is rather a selling point for logitr and mlogit. Maybe I start a discussion at statalist.org after finding time to deep dive on the cmmixlogit code to try to understand why Stata ignores the panel structure when estimating the maximum likelihood.

On points 2 & 3, I think a general conclusion from this investigation is that we probably should be using more draws, and right now the default number used in logitr of 50 is insufficient. I will probably make an update that uses the same approach as cmmixlogit which calculates a number of draws based on the number of random parameters in the model, etc. I believe making this change will be sufficient to address the core issue of any potential bias in logitr for a mixed logit model, so that is my proposed solution to close out this issue.

I agree. I think we should be on the safer side the more draws we take. I like your proposed idea of making the number of draws to be a function of the number of random parameters to be estimated.

On point 4 I remain puzzled as to why the x2_sd parameter is so stubbornly biased (albeit rather small bias). In the original simulations when we only had one random parameter, this was the term that got me concerned about the potential for some systematic bias, but then when we add a second random term we see that the parameters on the second random parameter are unbiased. That's the real puzzle - if there's something systematically wrong with the algorithm, then why isn't the x3_sd term also biased? There is truly no justification for this other than it being some fluke of our simulation experiments. The fact that it appears for both halton and sobol draws is also notable as it suggests it may not have to do with the draws themselves but perhaps could be some strange structure in the simulated data. All that said, if we crank up the number of draws to a rather large number we can probably get it to go away completely, and there work that suggests we should be using far more draws than we are (on the order of thousands, not hundreds). So perhaps it's just the outcome of using too few draws.

I agree. Did you have a chance to test my suggestion of switching the base of Halton draws from 2 to 3? I am happy to test it by myself but I am not sure which logitr version takes which kind of standardDraws argument.

It does raise the question though why cmmixlogit does not have this bias even with fewer draws.

Indeed, I think just a deep dive of the cmmixlogit code would solve this mystery.

In any case, I don't think the conclusion of this whole exercise is that logitr or mlogit are biased, but rather that users should use more draws than we've been recommending to ensure that the model is obtaining the best estimates.

I think my perspective is just slightly different in the sense that I would argue that the number of draws should be sufficiently high to obtain reliable effects. In other words, it seems that the difference between the estimates and their true values is decreasing by increasing the number of draws yielding a consistent estimator but a certain threshold of number of draws should be exceeded to ensure that this difference is substantially small for every random parameter. I think this view is in line with Cameron & Trivedi (Chapter 15.7.1) and Train (Chapter 10.5.1) in terms of asymptotics. For finite samples / number of draws, this minimum thresshold seems to matter to prevent large biases.

With all of that said, I think my proposal to update the number of draws to be based on the number of random parameters is an appropriate change to address this issue, and once I've done that I think I can close out this issue (and reply to the original SO post). Do you agree? Anything else you'd recommend?

I agree. Thanks for your permanent support and interest in this issue!

Finally, I think it might be worthwile to restrict the distribution of the lambda parameter in the WTP space to be log-normal as discussed in the SO post.

jhelvy commented 10 months ago

I agree. Did you have a chance to test my suggestion of switching the base of Halton draws from 2 to 3? I am happy to test it by myself but I am not sure which logitr version takes which kind of standardDraws argument.

Yes, I actually wrote my own version of the halton sequence to test this, and no matter what I used as the first prime I got similar results, so I don't think it's the prime. I'm going to keep the dependency on the randtoolbox package as I suspect it is much more robustly tested.

Finally, I think it might be worthwile to restrict the distribution of the lambda parameter in the WTP space to be log-normal as discussed in the SO post.

Good point, I had forgotten about this. But I might instead just restrict it so it can't be normal, but it could be log-normal or censored normal, both which force positivity. In practice, anything but making the scale parameter fixed tends to have a much harder time converging, and that's actually the original motivation for why I included a multi-start option in the package because WTP space models tend to diverge more often. I'll make an updated version that includes this change as well as the change about how many draws are used, then I'll close this out and update the original SO post.