statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
10.1k stars 2.88k forks source link

Implementing Generalized Ordered Logit Model with Non-Parallel Lines #9329

Open mateosere opened 2 months ago

mateosere commented 2 months ago

Hello statsmodels community,

I am currently working on a project that involves modeling ordinal outcomes where the proportional odds (parallel lines) assumption does not hold. Specifically, I need to fit a generalized ordered logit model with country-specific thresholds, akin to the gologit2 command in Stata or the vglm function in R with cumulative(parallel = FALSE).

Given the complexity of this model, I am looking for advice on the best approach to implement it in Python using statsmodels. Are there any built-in functions or recommended workflows within statsmodels that could help achieve this? Alternatively, are there examples or resources that you would recommend for custom implementations of non-parallel ordinal models?

I would greatly appreciate any guidance or suggestions from the community.

Thank you!

josef-pkt commented 2 months ago

see #8444 (I will close this as duplicate)

I think the gologit can be implemented similar to the current OrderedModel with the extra thresholds. OrderedModel was merged in #7035 with two earlier PRs. OrderedModel is based on GenericLikelihoodModel which provides the general MLE setup and mainly loglike and score need to be defined for the specific model (plus results that go beyond generic MLE results)

to clarify:

fit a generalized ordered logit model with country-specific thresholds

This does not sound like a gologit, which, AFAIR, parameters differ by choice, while country specific thresholds/parameters vary by an exogenous variable. Can you clarify how your model is specified?

for gologit: The main problem that I saw when I looked at it, is that it is not possible to impose that thresholds are monotonically increasing. (probabilities/integrals could be negative and we need to enforce non-negativity) The current implementation uses a reparameterization that enforces strict monotonicity of thresholds. This will not be possible or useful in the GO case. The non-monotonicity problem will likely show up if some choice probabilities are small and might become negative during the estimation. That's roughly where I stopped looking at goLogit and did not try to write a prototype model.

mateosere commented 2 months ago

Thank you for the quick response and the detailed explanation.

To clarify, what I am trying to achieve is akin to the functionality provided by Stata's gologit2 with the npl (non-parallel lines) option. However, my specific requirement is to have thresholds that vary by country (i.e., country-specific thresholds). In other words, I need a model where:

In Stata, the gologit2 command with the npl option allows for this kind of flexibility, where different thresholds are estimated for different countries, while still assuming that the overall structure of the model (e.g., the impact of covariates) is consistent.

To be more specific, the Stata package I'm talking about is https://www.stata.com/meeting/4nasug/gologit2.pdf and the implementation in Stata would be something like gologit2 score $countries $covariates i.unique_booking_id, npl($countries ) link(logit)

Thank you again for your assistance!

josef-pkt commented 2 months ago

What does the "$" sign do in this case? does it just mean the explicit list of countries.

If I understand correctly, then this specifies country specific thresholds/cutoffs but all slope variables are constant across choices. This then yields n_countries * (n_choices - 1) cutoff values. This is theoretically simpler than slopes varying by choice because cutoffs always vary by choice and we could impose monotonicity of cutoffs for each country. It's structurally still similar to OrderedModel and implementation might mainly require to handle vectorized cutoffs. It would be a different extension of the ordered model because it requires vectorized cutoffs instead of choice-specific slopes.

for identification: AFAIU, this would require that for each country there are observations in each of the choices. Otherwise, some cutoffs (with zero cell counts) would not be identified. Is this correct? Do you have all choices observed for each country?

Although in terms of implementation: gologit needs to treat some slope parameters similar to cutoffs. In the current version of OrderedModel, cutoffs and slope parameters are treated separately. In gologit, we might be able to handle varying cutoffs and choice specific slope parameters in the same way, i.e. we don't treat cutoffs separately from choice specific slope parameters and could implement both extensions in the same way. (categorical dummy variable representation of cutoffs)

mateosere commented 2 months ago

Hi,

Thank you for the detailed follow-up and questions.

Here’s a simplified excerpt from the Stata output to illustrate the country-specific thresholds. Due to the large dataset, Stata struggles with memory issues, which is why I’m trying to achieve the same with Python.


-----------------------------------------------------------------------------------
            score | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
------------------+----------------------------------------------------------------
1                 |
          Algeria |  -1.06e+08          .        .       .            .           .
        Argentina |   -4.82174   1.385257    -3.48   0.001    -7.536795   -2.106686
        Australia |  -4.93e+09          .        .       .            .           .
          Austria |          0  (omitted)
          Bahrain |   .4265083    1.00909     0.42   0.673    -1.551272    2.404288
          Belgium |  -.3204708   .5175524    -0.62   0.536    -1.334855    .6939133
           Brazil |          0  (omitted)

           nights |   .0203716   .0068274     2.98   0.003     .0069902     .033753
         domestic |  -.0974249   .0349229    -2.79   0.005    -.1658725   -.0289773
                  |
unique_booking_id |
          721137  |  -.4818849    .045815   -10.52   0.000    -.5716806   -.3920891
          802036  |   .0652769   .0449225     1.45   0.146    -.0227697    .1533234
          802918  |  -.0103567   .0546475    -0.19   0.850    -.1174639    .0967505
          803701  |   -.472349   .0469168   -10.07   0.000    -.5643042   -.3803937
          805873  |   .9872502   .0472735    20.88   0.000     .8945958    1.079905
                  |
            _cons |   4.467545   .0809258    55.21   0.000     4.308933    4.626156
------------------+----------------------------------------------------------------
2                 |
          Algeria |   4.577465   .0223119   205.16   0.000     4.533735    4.621196
        Argentina |  -6.847872   2.194663    -3.12   0.002    -11.14933   -2.546413
        Australia |   39.33129   .0635776   618.63   0.000     39.20668     39.4559
          Austria |   .2576597   4.953937     0.05   0.959    -9.451878    9.967197
          Bahrain |  -.2574898   .5903717    -0.44   0.663    -1.414597    .8996174
          Belgium |  -.1113062   .4646262    -0.24   0.811    -1.021957    .7993443
           Brazil |    115.868   .1041519  1112.49   0.000     115.6639    116.0722
           couple |  -.0547176   .0609596    -0.90   0.369    -.1741962     .064761
           nights |   .0203716   .0068274     2.98   0.003     .0069902     .033753
         domestic |  -.0974249   .0349229    -2.79   0.005    -.1658725   -.0289773
                  |
unique_booking_id |
          721137  |  -.4818849    .045815   -10.52   0.000    -.5716806   -.3920891
          802036  |   .0652769   .0449225     1.45   0.146    -.0227697    .1533234
          802918  |  -.0103567   .0546475    -0.19   0.850    -.1174639    .0967505
          803701  |   -.472349   .0469168   -10.07   0.000    -.5643042   -.3803937
          805873  |   .9872502   .0472735    20.88   0.000     .8945958    1.079905
                  |
            _cons |   4.029434   .0692498    58.19   0.000     3.893706    4.165161
.
.
.

------------------+----------------------------------------------------------------
9                 |
          Algeria |  -.2637795    .474502    -0.56   0.578    -1.193786    .6662273
        Argentina |   .3087265   .2999835     1.03   0.303    -.2792304    .8966835
        Australia |   .5907469   .1558285     3.79   0.000     .2853286    .8961653
          Austria |  -.4181689   .2345123    -1.78   0.075    -.8778045    .0414667
          Bahrain |   .0045311   .1884284     0.02   0.981    -.3647818     .373844
          Belgium |  -.5115452   .1644764    -3.11   0.002     -.833913   -.1891775
           Brazil |   .4271591   .1435852     2.97   0.003     .1457374    .7085809
          leisure |  -.2934047   .0358448    -8.19   0.000    -.3636592   -.2231503
           nights |   .0203716   .0068274     2.98   0.003     .0069902     .033753
         domestic |  -.0974249   .0349229    -2.79   0.005    -.1658725   -.0289773
                  |
unique_booking_id |
          721137  |  -.4818849    .045815   -10.52   0.000    -.5716806   -.3920891
          802036  |   .0652769   .0449225     1.45   0.146    -.0227697    .1533234
          802918  |  -.0103567   .0546475    -0.19   0.850    -.1174639    .0967505
          803701  |   -.472349   .0469168   -10.07   0.000    -.5643042   -.3803937
          805873  |   .9872502   .0472735    20.88   0.000     .8945958    1.079905
                  |
            _cons |  -.6506156   .0412478   -15.77   0.000    -.7314597   -.5697715
-----------------------------------------------------------------------------------
josef-pkt commented 2 months ago

how large is your dataset? how many countries and observations? The number of parameters will be pretty large with many countries and 9 (or 10?) levels of y.

Do you know why level 1 has strange results? Algeria, Australia huge coefficient and no inferential statistics. Austria and Brazil, omitted, I guess constrained to zero

Given that there is an extra intercept _cons: Is one of the countries dropped as reference level to avoid unidentified duplicated constant?

josef-pkt commented 2 months ago

Given that there is an extra intercept _cons: Is one of the countries dropped as reference level to avoid unidentified duplicated constant?

shouldn't be a problem: intercept _cons only fixes the lowest level. AFAICS, there are no reference level cutoffs, i.e. it's like full one-way categorical regressor without including a constant in the equation.

josef-pkt commented 2 months ago

BTW: Did you try MNLogit which is more general but does not impose the ordinality constraints?

Can it handle the size of your data with country dummies?

josef-pkt commented 2 months ago

In terms of implementation:

I guess it would be worth it to implement it as direct extension to OrderedModel with cutoffs depending on a categorical "exog" instead of merging it with gologit choice-specific parameters. (fixed effects cutoffs) I think being able to impose monotonicity and reduced computational/technical problems is worth the extra effort.

(so it's not necessarily a duplicate of #8444 anymore)

one of the main tricky parts in the implementation will be that cutoffs are essentially 3-dimensional, and we might need ravel and reshape similar to MNLogit.