Implementing Generalized Ordered Logit Model with Non-Parallel Lines

mateosere commented 2 months ago

Hello statsmodels community,

I am currently working on a project that involves modeling ordinal outcomes where the proportional odds (parallel lines) assumption does not hold. Specifically, I need to fit a generalized ordered logit model with country-specific thresholds, akin to the gologit2 command in Stata or the vglm function in R with cumulative(parallel = FALSE).

Given the complexity of this model, I am looking for advice on the best approach to implement it in Python using statsmodels. Are there any built-in functions or recommended workflows within statsmodels that could help achieve this? Alternatively, are there examples or resources that you would recommend for custom implementations of non-parallel ordinal models?

I would greatly appreciate any guidance or suggestions from the community.

Thank you!

josef-pkt commented 2 months ago

see #8444 (I will close this as duplicate)

I think the gologit can be implemented similar to the current OrderedModel with the extra thresholds. OrderedModel was merged in #7035 with two earlier PRs. OrderedModel is based on GenericLikelihoodModel which provides the general MLE setup and mainly loglike and score need to be defined for the specific model (plus results that go beyond generic MLE results)

to clarify:

fit a generalized ordered logit model with country-specific thresholds

This does not sound like a gologit, which, AFAIR, parameters differ by choice, while country specific thresholds/parameters vary by an exogenous variable. Can you clarify how your model is specified?

for gologit: The main problem that I saw when I looked at it, is that it is not possible to impose that thresholds are monotonically increasing. (probabilities/integrals could be negative and we need to enforce non-negativity) The current implementation uses a reparameterization that enforces strict monotonicity of thresholds. This will not be possible or useful in the GO case. The non-monotonicity problem will likely show up if some choice probabilities are small and might become negative during the estimation. That's roughly where I stopped looking at goLogit and did not try to write a prototype model.

mateosere commented 2 months ago

Thank you for the quick response and the detailed explanation.

To clarify, what I am trying to achieve is akin to the functionality provided by Stata's gologit2 with the npl (non-parallel lines) option. However, my specific requirement is to have thresholds that vary by country (i.e., country-specific thresholds). In other words, I need a model where:

The thresholds are allowed to differ across countries, reflecting different cut-off points for the ordinal outcome variable within each country.
The coefficients for other covariates (e.g., leisure, group, family, etc.) remain consistent across countries, but the position of the thresholds on the latent scale can shift depending on the country.

In Stata, the gologit2 command with the npl option allows for this kind of flexibility, where different thresholds are estimated for different countries, while still assuming that the overall structure of the model (e.g., the impact of covariates) is consistent.

To be more specific, the Stata package I'm talking about is https://www.stata.com/meeting/4nasug/gologit2.pdf and the implementation in Stata would be something like gologit2 score $countries $covariates i.unique_booking_id, npl($countries ) link(logit)

Thank you again for your assistance!

josef-pkt commented 2 months ago

What does the "$" sign do in this case? does it just mean the explicit list of countries.

If I understand correctly, then this specifies country specific thresholds/cutoffs but all slope variables are constant across choices. This then yields n_countries * (n_choices - 1) cutoff values. This is theoretically simpler than slopes varying by choice because cutoffs always vary by choice and we could impose monotonicity of cutoffs for each country. It's structurally still similar to OrderedModel and implementation might mainly require to handle vectorized cutoffs. It would be a different extension of the ordered model because it requires vectorized cutoffs instead of choice-specific slopes.

for identification: AFAIU, this would require that for each country there are observations in each of the choices. Otherwise, some cutoffs (with zero cell counts) would not be identified. Is this correct? Do you have all choices observed for each country?

Although in terms of implementation: gologit needs to treat some slope parameters similar to cutoffs. In the current version of OrderedModel, cutoffs and slope parameters are treated separately. In gologit, we might be able to handle varying cutoffs and choice specific slope parameters in the same way, i.e. we don't treat cutoffs separately from choice specific slope parameters and could implement both extensions in the same way. (categorical dummy variable representation of cutoffs)

mateosere commented 2 months ago

Hi,

Thank you for the detailed follow-up and questions.

Country-Specific Thresholds: Yes, you understood correctly. My model specification aims to estimate thresholds (cutoff points) that vary by country, while keeping the coefficients for other covariates consistent across countries. This accounts for potential differences in how different countries interpret the ordinal outcomes.
Stata Implementation: In Stata, I achieve this using the gologit2 command with the npl($countries) option, which allows for non-parallel lines (different thresholds) across countries. The $countries variable is simply a list of all countries included in the analysis, defined as a global macro. The npl($countries) option in gologit2 specifies that the thresholds should vary across these countries.
Model Simplicity: The model is structurally simpler than a full generalized ordered logit model where slopes vary by choice.
Identification: You are correct in pointing out the identification issue; for each country, observations across all choices (categories) are necessary to identify the country-specific thresholds. In my dataset, each country has observations in all choices (categories), which ensures that all thresholds are identified.

Here’s a simplified excerpt from the Stata output to illustrate the country-specific thresholds. Due to the large dataset, Stata struggles with memory issues, which is why I’m trying to achieve the same with Python.


-----------------------------------------------------------------------------------
            score | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
------------------+----------------------------------------------------------------
1                 |
          Algeria |  -1.06e+08          .        .       .            .           .
        Argentina |   -4.82174   1.385257    -3.48   0.001    -7.536795   -2.106686
        Australia |  -4.93e+09          .        .       .            .           .
          Austria |          0  (omitted)
          Bahrain |   .4265083    1.00909     0.42   0.673    -1.551272    2.404288
          Belgium |  -.3204708   .5175524    -0.62   0.536    -1.334855    .6939133
           Brazil |          0  (omitted)

           nights |   .0203716   .0068274     2.98   0.003     .0069902     .033753
         domestic |  -.0974249   .0349229    -2.79   0.005    -.1658725   -.0289773
                  |
unique_booking_id |
          721137  |  -.4818849    .045815   -10.52   0.000    -.5716806   -.3920891
          802036  |   .0652769   .0449225     1.45   0.146    -.0227697    .1533234
          802918  |  -.0103567   .0546475    -0.19   0.850    -.1174639    .0967505
          803701  |   -.472349   .0469168   -10.07   0.000    -.5643042   -.3803937
          805873  |   .9872502   .0472735    20.88   0.000     .8945958    1.079905
                  |
            _cons |   4.467545   .0809258    55.21   0.000     4.308933    4.626156
------------------+----------------------------------------------------------------
2                 |
          Algeria |   4.577465   .0223119   205.16   0.000     4.533735    4.621196
        Argentina |  -6.847872   2.194663    -3.12   0.002    -11.14933   -2.546413
        Australia |   39.33129   .0635776   618.63   0.000     39.20668     39.4559
          Austria |   .2576597   4.953937     0.05   0.959    -9.451878    9.967197
          Bahrain |  -.2574898   .5903717    -0.44   0.663    -1.414597    .8996174
          Belgium |  -.1113062   .4646262    -0.24   0.811    -1.021957    .7993443
           Brazil |    115.868   .1041519  1112.49   0.000     115.6639    116.0722
           couple |  -.0547176   .0609596    -0.90   0.369    -.1741962     .064761
           nights |   .0203716   .0068274     2.98   0.003     .0069902     .033753
         domestic |  -.0974249   .0349229    -2.79   0.005    -.1658725   -.0289773
                  |
unique_booking_id |
          721137  |  -.4818849    .045815   -10.52   0.000    -.5716806   -.3920891
          802036  |   .0652769   .0449225     1.45   0.146    -.0227697    .1533234
          802918  |  -.0103567   .0546475    -0.19   0.850    -.1174639    .0967505
          803701  |   -.472349   .0469168   -10.07   0.000    -.5643042   -.3803937
          805873  |   .9872502   .0472735    20.88   0.000     .8945958    1.079905
                  |
            _cons |   4.029434   .0692498    58.19   0.000     3.893706    4.165161
.
.
.

------------------+----------------------------------------------------------------
9                 |
          Algeria |  -.2637795    .474502    -0.56   0.578    -1.193786    .6662273
        Argentina |   .3087265   .2999835     1.03   0.303    -.2792304    .8966835
        Australia |   .5907469   .1558285     3.79   0.000     .2853286    .8961653
          Austria |  -.4181689   .2345123    -1.78   0.075    -.8778045    .0414667
          Bahrain |   .0045311   .1884284     0.02   0.981    -.3647818     .373844
          Belgium |  -.5115452   .1644764    -3.11   0.002     -.833913   -.1891775
           Brazil |   .4271591   .1435852     2.97   0.003     .1457374    .7085809
          leisure |  -.2934047   .0358448    -8.19   0.000    -.3636592   -.2231503
           nights |   .0203716   .0068274     2.98   0.003     .0069902     .033753
         domestic |  -.0974249   .0349229    -2.79   0.005    -.1658725   -.0289773
                  |
unique_booking_id |
          721137  |  -.4818849    .045815   -10.52   0.000    -.5716806   -.3920891
          802036  |   .0652769   .0449225     1.45   0.146    -.0227697    .1533234
          802918  |  -.0103567   .0546475    -0.19   0.850    -.1174639    .0967505
          803701  |   -.472349   .0469168   -10.07   0.000    -.5643042   -.3803937
          805873  |   .9872502   .0472735    20.88   0.000     .8945958    1.079905
                  |
            _cons |  -.6506156   .0412478   -15.77   0.000    -.7314597   -.5697715
-----------------------------------------------------------------------------------

josef-pkt commented 2 months ago

how large is your dataset? how many countries and observations? The number of parameters will be pretty large with many countries and 9 (or 10?) levels of y.

Do you know why level 1 has strange results? Algeria, Australia huge coefficient and no inferential statistics. Austria and Brazil, omitted, I guess constrained to zero

Given that there is an extra intercept _cons: Is one of the countries dropped as reference level to avoid unidentified duplicated constant?

josef-pkt commented 2 months ago

Given that there is an extra intercept _cons: Is one of the countries dropped as reference level to avoid unidentified duplicated constant?

shouldn't be a problem: intercept _cons only fixes the lowest level. AFAICS, there are no reference level cutoffs, i.e. it's like full one-way categorical regressor without including a constant in the equation.

josef-pkt commented 2 months ago

BTW: Did you try MNLogit which is more general but does not impose the ordinality constraints?

Can it handle the size of your data with country dummies?

josef-pkt commented 2 months ago

In terms of implementation:

I guess it would be worth it to implement it as direct extension to OrderedModel with cutoffs depending on a categorical "exog" instead of merging it with gologit choice-specific parameters. (fixed effects cutoffs) I think being able to impose monotonicity and reduced computational/technical problems is worth the extra effort.

(so it's not necessarily a duplicate of #8444 anymore)

one of the main tricky parts in the implementation will be that cutoffs are essentially 3-dimensional, and we might need ravel and reshape similar to MNLogit.

statsmodels / statsmodels

Implementing Generalized Ordered Logit Model with Non-Parallel Lines #9329