statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
9.74k stars 2.84k forks source link

ENH: MNLogit for counts or frequencies, similar to Binomial #8380

Open brett1479 opened 1 year ago

brett1479 commented 1 year ago

Maybe I have missed it in the library, but is there a way to use frequency weights for multinomial logistic regression? If not, I would like to request it.

edit question is for fractional or count endog in MNLogit and not for freq_weights see discussion below

brett1479 commented 1 year ago

Looking more, is this request as simple as documenting that you can modify model.wendog with rows that sum to 1?

josef-pkt commented 1 year ago

No, that won't work, rows of (multivariate version of) endog have to add to 1, i.e. they are either dummy variables (with or without reference choice) or the index of the choice, The predicted endog will be choice probabilities which have to add also to 1 when reference choice is included.

Adding freq_weights is quite a bit of work, only GLM has it so far. The main part is that loglike, score, and hessian computation need to add weighted sums. For inference we need weighted/total nobs and not nobs defined as number of rows.

freq_weights are conceptually relatively simple, but they will have to be added at many different places so that most methods take them correctly into account.

One complication is that large parts of classes use inherited, more generic methods inside the discrete classes, so changes will not be restricted to the MNLogit model and result classes.

brett1479 commented 1 year ago

Just to make sure I understand, if I do something like

model = MNLogit(Y,X)
model.wendog = <n x j array whose rows contain non-negative potentially fractional elements that **do** sum to 1>
model.fit().summary()

you are saying this will not work? I briefly looked at the code and thought your computations of the log-likelihood, score, and Hessian would all still be correct, but maybe I misread. Assuming these three functions work correctly, will the base class functionality not continue to work based on this, or am I missing some assumptions made by those methods?

josef-pkt commented 1 year ago

maybe I misunderstood what you want.

Do you want fractional choices, i.e. instead of a discrete 0-1 choice for each level, you have fractions for each choice, i.e. a continuous wendog?

freq_weights just assumes that each row represents more than one observation, e.g. if we have only categorical regressors and we want to combine observations with identical regressors and identical endog.

To the first, I don't know the answer, I would have to check details of the code. e.g. probit and logit use some shortcuts in the computation that only work with 0-1 endog and not with fractions. GLM does not have that problem. #7210 This depends on implementation details and I or we never checked this for MNLogit. (In OrderedModel, I also use a computational shortcut that assumes discrete 0-1 choice.)

The only other related issue that I find is #3537 which ended up discussing compositional data. (I had thought about fractional extension to OrderedModel, but decided to stick with not supporting fractional data because of much larger computational and memory requirements.)

brett1479 commented 1 year ago

Yes, fractional choices. I realized by calling it "freq_weights" I may have derailed the conversation.

josef-pkt commented 1 year ago

Another thought:

GLM binomial is a model for both binary choices and counts. Internally it works with choice fractions (events) and the number of observation for it (number of trials).

The same idea would apply to multinomial logit models. In this case we would either need count frequencies for each choice or fraction for each choice plus the number of trials in that row of data. We might not need the number of trials if they are the same for all observations/rows.) This would be a proper MLE model and not a Quasi-MLE model for fractional or compositional data. (I was previously thinking only of QMLE for multivariate fractional data.)

It's a good enhancement request, but I don't know how much support for it there is already in the current MNLogit. (We would have to add at least something equivalent to n_trials.)

josef-pkt commented 1 year ago

I edited my previous reply on freq_weights

to clarify:

freq_weights can be used to combine observations with identical endog and identical exog. var_weights and number of trials/exposure can be used to include observations that have identical exog but not identical endog, but where endog is the observed mean (or sum in case of Poisson) of several observations.

josef-pkt commented 1 year ago

AFAICS, MNLogit loglike, score and hessian use the full endog version, e.g. compute all logprob and not just the one corresponding to the selected discrete choice (as I do in OrderedModel).

So it should work with fractional wendog. This assumes the underlying n_trials is the same across rows for the estimation. Most likely inference will not have the correct number of total underlying observations if each data row represents cases of several discrete choices. I think in terms of var_weights we are missing something like n_trials. Maybe scale = 1 / n_trials would work if n_trials array is constant. (thinking about the analogy to Binomial)

josef-pkt commented 1 year ago

loglike doesn't have the ratio of factorial terms (for permutation) in loglike. It's constant 1 for single choice. However, AFAICS the term does not depend on params and so will drop out in score and hessian and will not affect estimation.

brett1479 commented 1 year ago

Agreed (that it looks like it will all work for my use-case), and given that my use-case seems to be handled, I will be totally satisfied with the current state of MNLogit. Thank you for the discussion and the great package!

josef-pkt commented 1 year ago

This issue should be split in two

1) QMLE for fractional data, check that it works and is correct with robust cov_type 2) add/extend to multinomial count model

to 2) I checked Stata which also seems to have only discrete choice version, choose 1 out of k choices. I didn't see a count version.

The count model version seems to be more common in statistics than econometrics e.g. Zhang et al refer to McCullagh and Nelder 1983 for the count model version of multinomial logit

Zhang, Yiwen, Hua Zhou, Jin Zhou, and Wei Sun. 2017. “Regression Models for Multivariate Count Data.” Journal of Computational and Graphical Statistics 26 (1): 1–13. https://doi.org/10.1080/10618600.2016.1154063.

We currently don't have a model version for a 2 x k contingency table with 2 samples and k multinomial categories of counts. Using poisson loglinear model does not impose n_trials is fixed. MNLogit could be used if we blow up the number of rows to have individual choices instead of summary counts for each exog pattern.

(Most likely a reason for the focus on individual discrete choice in econometrics is the usual presence of continuous explanatory variables so we have as many exog patterns as individuals.)