Open mateosere opened 2 months ago
see #8444 (I will close this as duplicate)
I think the gologit can be implemented similar to the current OrderedModel with the extra thresholds. OrderedModel was merged in #7035 with two earlier PRs. OrderedModel is based on GenericLikelihoodModel which provides the general MLE setup and mainly loglike and score need to be defined for the specific model (plus results that go beyond generic MLE results)
to clarify:
fit a generalized ordered logit model with country-specific thresholds
This does not sound like a gologit, which, AFAIR, parameters differ by choice, while country specific thresholds/parameters vary by an exogenous variable. Can you clarify how your model is specified?
for gologit: The main problem that I saw when I looked at it, is that it is not possible to impose that thresholds are monotonically increasing. (probabilities/integrals could be negative and we need to enforce non-negativity) The current implementation uses a reparameterization that enforces strict monotonicity of thresholds. This will not be possible or useful in the GO case. The non-monotonicity problem will likely show up if some choice probabilities are small and might become negative during the estimation. That's roughly where I stopped looking at goLogit and did not try to write a prototype model.
Thank you for the quick response and the detailed explanation.
To clarify, what I am trying to achieve is akin to the functionality provided by Stata's gologit2 with the npl (non-parallel lines) option. However, my specific requirement is to have thresholds that vary by country (i.e., country-specific thresholds). In other words, I need a model where:
In Stata, the gologit2 command with the npl option allows for this kind of flexibility, where different thresholds are estimated for different countries, while still assuming that the overall structure of the model (e.g., the impact of covariates) is consistent.
To be more specific, the Stata package I'm talking about is https://www.stata.com/meeting/4nasug/gologit2.pdf and the implementation in Stata would be something like gologit2 score $countries $covariates i.unique_booking_id, npl($countries ) link(logit)
Thank you again for your assistance!
What does the "$" sign do in this case? does it just mean the explicit list of countries.
If I understand correctly, then this specifies country specific thresholds/cutoffs but all slope variables are constant across choices. This then yields n_countries * (n_choices - 1) cutoff values. This is theoretically simpler than slopes varying by choice because cutoffs always vary by choice and we could impose monotonicity of cutoffs for each country. It's structurally still similar to OrderedModel and implementation might mainly require to handle vectorized cutoffs. It would be a different extension of the ordered model because it requires vectorized cutoffs instead of choice-specific slopes.
for identification: AFAIU, this would require that for each country there are observations in each of the choices. Otherwise, some cutoffs (with zero cell counts) would not be identified. Is this correct? Do you have all choices observed for each country?
Although in terms of implementation: gologit needs to treat some slope parameters similar to cutoffs. In the current version of OrderedModel, cutoffs and slope parameters are treated separately. In gologit, we might be able to handle varying cutoffs and choice specific slope parameters in the same way, i.e. we don't treat cutoffs separately from choice specific slope parameters and could implement both extensions in the same way. (categorical dummy variable representation of cutoffs)
Hi,
Thank you for the detailed follow-up and questions.
Country-Specific Thresholds: Yes, you understood correctly. My model specification aims to estimate thresholds (cutoff points) that vary by country, while keeping the coefficients for other covariates consistent across countries. This accounts for potential differences in how different countries interpret the ordinal outcomes.
Stata Implementation: In Stata, I achieve this using the gologit2 command with the npl($countries) option, which allows for non-parallel lines (different thresholds) across countries. The $countries variable is simply a list of all countries included in the analysis, defined as a global macro. The npl($countries) option in gologit2 specifies that the thresholds should vary across these countries.
Model Simplicity: The model is structurally simpler than a full generalized ordered logit model where slopes vary by choice.
Identification: You are correct in pointing out the identification issue; for each country, observations across all choices (categories) are necessary to identify the country-specific thresholds. In my dataset, each country has observations in all choices (categories), which ensures that all thresholds are identified.
Here’s a simplified excerpt from the Stata output to illustrate the country-specific thresholds. Due to the large dataset, Stata struggles with memory issues, which is why I’m trying to achieve the same with Python.
-----------------------------------------------------------------------------------
score | Coefficient Std. err. z P>|z| [95% conf. interval]
------------------+----------------------------------------------------------------
1 |
Algeria | -1.06e+08 . . . . .
Argentina | -4.82174 1.385257 -3.48 0.001 -7.536795 -2.106686
Australia | -4.93e+09 . . . . .
Austria | 0 (omitted)
Bahrain | .4265083 1.00909 0.42 0.673 -1.551272 2.404288
Belgium | -.3204708 .5175524 -0.62 0.536 -1.334855 .6939133
Brazil | 0 (omitted)
nights | .0203716 .0068274 2.98 0.003 .0069902 .033753
domestic | -.0974249 .0349229 -2.79 0.005 -.1658725 -.0289773
|
unique_booking_id |
721137 | -.4818849 .045815 -10.52 0.000 -.5716806 -.3920891
802036 | .0652769 .0449225 1.45 0.146 -.0227697 .1533234
802918 | -.0103567 .0546475 -0.19 0.850 -.1174639 .0967505
803701 | -.472349 .0469168 -10.07 0.000 -.5643042 -.3803937
805873 | .9872502 .0472735 20.88 0.000 .8945958 1.079905
|
_cons | 4.467545 .0809258 55.21 0.000 4.308933 4.626156
------------------+----------------------------------------------------------------
2 |
Algeria | 4.577465 .0223119 205.16 0.000 4.533735 4.621196
Argentina | -6.847872 2.194663 -3.12 0.002 -11.14933 -2.546413
Australia | 39.33129 .0635776 618.63 0.000 39.20668 39.4559
Austria | .2576597 4.953937 0.05 0.959 -9.451878 9.967197
Bahrain | -.2574898 .5903717 -0.44 0.663 -1.414597 .8996174
Belgium | -.1113062 .4646262 -0.24 0.811 -1.021957 .7993443
Brazil | 115.868 .1041519 1112.49 0.000 115.6639 116.0722
couple | -.0547176 .0609596 -0.90 0.369 -.1741962 .064761
nights | .0203716 .0068274 2.98 0.003 .0069902 .033753
domestic | -.0974249 .0349229 -2.79 0.005 -.1658725 -.0289773
|
unique_booking_id |
721137 | -.4818849 .045815 -10.52 0.000 -.5716806 -.3920891
802036 | .0652769 .0449225 1.45 0.146 -.0227697 .1533234
802918 | -.0103567 .0546475 -0.19 0.850 -.1174639 .0967505
803701 | -.472349 .0469168 -10.07 0.000 -.5643042 -.3803937
805873 | .9872502 .0472735 20.88 0.000 .8945958 1.079905
|
_cons | 4.029434 .0692498 58.19 0.000 3.893706 4.165161
.
.
.
------------------+----------------------------------------------------------------
9 |
Algeria | -.2637795 .474502 -0.56 0.578 -1.193786 .6662273
Argentina | .3087265 .2999835 1.03 0.303 -.2792304 .8966835
Australia | .5907469 .1558285 3.79 0.000 .2853286 .8961653
Austria | -.4181689 .2345123 -1.78 0.075 -.8778045 .0414667
Bahrain | .0045311 .1884284 0.02 0.981 -.3647818 .373844
Belgium | -.5115452 .1644764 -3.11 0.002 -.833913 -.1891775
Brazil | .4271591 .1435852 2.97 0.003 .1457374 .7085809
leisure | -.2934047 .0358448 -8.19 0.000 -.3636592 -.2231503
nights | .0203716 .0068274 2.98 0.003 .0069902 .033753
domestic | -.0974249 .0349229 -2.79 0.005 -.1658725 -.0289773
|
unique_booking_id |
721137 | -.4818849 .045815 -10.52 0.000 -.5716806 -.3920891
802036 | .0652769 .0449225 1.45 0.146 -.0227697 .1533234
802918 | -.0103567 .0546475 -0.19 0.850 -.1174639 .0967505
803701 | -.472349 .0469168 -10.07 0.000 -.5643042 -.3803937
805873 | .9872502 .0472735 20.88 0.000 .8945958 1.079905
|
_cons | -.6506156 .0412478 -15.77 0.000 -.7314597 -.5697715
-----------------------------------------------------------------------------------
how large is your dataset? how many countries and observations? The number of parameters will be pretty large with many countries and 9 (or 10?) levels of y.
Do you know why level 1 has strange results? Algeria, Australia huge coefficient and no inferential statistics. Austria and Brazil, omitted, I guess constrained to zero
Given that there is an extra intercept _cons
: Is one of the countries dropped as reference level to avoid unidentified duplicated constant?
Given that there is an extra intercept
_cons:
Is one of the countries dropped as reference level to avoid unidentified duplicated constant?
shouldn't be a problem: intercept _cons
only fixes the lowest level.
AFAICS, there are no reference level cutoffs, i.e. it's like full one-way categorical regressor without including a constant in the equation.
BTW: Did you try MNLogit which is more general but does not impose the ordinality constraints?
Can it handle the size of your data with country dummies?
In terms of implementation:
I guess it would be worth it to implement it as direct extension to OrderedModel with cutoffs depending on a categorical "exog" instead of merging it with gologit choice-specific parameters. (fixed effects cutoffs) I think being able to impose monotonicity and reduced computational/technical problems is worth the extra effort.
(so it's not necessarily a duplicate of #8444 anymore)
one of the main tricky parts in the implementation will be that cutoffs are essentially 3-dimensional, and we might need ravel and reshape similar to MNLogit.
Hello statsmodels community,
I am currently working on a project that involves modeling ordinal outcomes where the proportional odds (parallel lines) assumption does not hold. Specifically, I need to fit a generalized ordered logit model with country-specific thresholds, akin to the gologit2 command in Stata or the vglm function in R with cumulative(parallel = FALSE).
Given the complexity of this model, I am looking for advice on the best approach to implement it in Python using statsmodels. Are there any built-in functions or recommended workflows within statsmodels that could help achieve this? Alternatively, are there examples or resources that you would recommend for custom implementations of non-parallel ordinal models?
I would greatly appreciate any guidance or suggestions from the community.
Thank you!