tidymodels / parsnip

A tidy unified interface to models
https://parsnip.tidymodels.org
Other
599 stars 89 forks source link

Ordinal mode (between classification and regression) #953

Open tripartio opened 1 year ago

tripartio commented 1 year ago

I think that a major missing gap in parsnip is explicit support for ordinal models, by which I mean models where the response variable is an ordered factor.

My proposal here is a follow up to closed issue 35, specifically focusing on ordinal models. @topepo wrote:

The main issue that I had as about how to organize the functions. In parsnip we try to have the main model functions describe the structural aspects of the model (e.g. linear_reg(), rand_forest(), etc). For the ordinal models based on generalized linear model (e.g. cumulative logits, adjacent categories etc), my thinking was to have:

ordinal_cumulative(link = "logit", odds = "proportional") ordinal_adjacent(link = "logit", odds = "proportional") and so on. My thinking is that people would probably want to look at the parallel assumption (assuming they have the right design for that) and tuning over the odd argument would be helpful.

How would you like to see these types of model organized?

I am not a statistical expert on ordinal models. but I have used them in my research code and I have studied to learn about different kinds. Based on my understanding, the examples above are the kind of detail that parsnip is supposed to abstract away. I would think that users care mainly about three things:

Function call versus mode

In his initial thoughts, @topepo seemed to frame ordinal models as a parsnip model type with names like ordinal_cumulative or ordinal_adjacent. But fundamentally, in parsnip terms, I do not think of an ordinal model as a model type; I think of it as a mode, just like classification and regression are modes. An ordinal model is simply one where the response variable type is neither an unordered class nor a real number but rather something in between, an ordered category. This is borne out by his mention of various types of ordinal (glmnetcr, ordinalNet, ordinalForest, party models for trees, and brms models). Each of these would correspond to their various underlying model types, be it logistic_reg or rand_forest, with their corresponding engines. ordinal would simply be the mode.

I don't think that most users need to get bogged down with the underlying ordinal model, especially when their interest is primarily predictive analytics that focuses primarily on the prediction rather than prescriptive or interpretive machine learning (IML) that cares about the meaning of the predictors. (Personally, I actually do care very much about these interpretive issues, but I am trying to think about the general user.) Users who are concerned with interpretation can simply specify the appropriate engine (e.g., MASS::polr or ordinal::clm when the parallel assumption is upheld and VGAM::vglm when it is not).

That said, I think it would be responsible to help newer users who are not familiar with the assumptions to do the test for them and warn them if it is violated. For example, if the default engine is ordinal::clm, the parsnip call could automatically run a parallel assumption test on the fitted model using something like brant::brant. If the test assumption fails (with default p-value of 0.05), then the model fit call could print a warning alerting the user that the data fails the parallel assumption test on which the chosen (or default) model depends, with a suggestion for another engine that does not require that assumption. The test results should probably be silent if there is no problem and only warn when there is. (Of course, all of these options should be customizable with arguments that could skip the test, change the p-value threshold, or print the results regardless of the result.)

Model outputs

Perhaps the reason that @topepo suggested distinct model types for different kinds of ordinal models is that the model outputs can be quite different. In particular, proportional (cumulative) odds models have a single coefficient for each predictor whereas other ordinal models like partial proportional or adjacent models might have as many as one coefficient per level of each predictor. But as I see it, these are details that are already handled in the Tidyverse by the broom package with its tidy function. Specifically for ordinal models,

tidy.clm, tidy.clmm, tidy.polr and tidy.svyolr return one row for each coefficient at each level of the response variable, with six columns

So, I don't think parsnip needs to worry about this, other than perhaps prioritizing the implementation of these engines that already have tidy handlers.

Prediction results

Unless I misunderstand something, the operation of the predict function for the ordinal mode should be very similar to that of multiclass classification: there should be options to return just the predicted ordinal class or to return probabilities for each class. Please let me know if there is a fundamental difference here that I am overlooking.

So, these are my thoughts. How feasible would it be add the ordinal mode to parsnip?

topepo commented 1 year ago

So there is a lot of grey area here. We have to make a choice, so here's my rationale...

I don't think that we need another mode. That usually entails system-wide changes.

Survival analysis is a good example. That impacts parsnip (new prediction types, inverse censoring weights), yardstick (dynamic and integrated performance measures), tune, and so on.

As a counter-example, we don't have different modes for binary and multinomial classification models. Their outcome data are slightly different but nothing else (outside of the engines that parsnip uses) has to change a lot.

I feel like ordinal models fall into the second case. The estimation methods change and that is confined to the functions called by parsnip (not even parsnip itself). We could have new performance metrics but there isn't much of a change beyond checking for ordered outcome classes.

Admittedly, part of my opposition is the significant amount of changes that occur with a new mode. I'd like to avoid those changes unless there are strong needs. Why add more to tune if the only changes are in parsnip, and so on.

So we'll use the classification mode for these models.

For the question of "just new engines", I think that is the case when there are no new (main) tuning parameters that should be easily accessible. We could add new engines in rand_forest() for ordinal engines because they fit into that format.

I do think that the generalized linear models do deserve their own functions with additional tuning parameters that modulate important structural elements of the model.

Users who are concerned with interpretation can simply specify the appropriate engine (e.g., MASS::polr or ordinal::clm when the parallel assumption is upheld and VGAM::vglm when it is not).

We'd like to avoid users viewing model structure changes in terms of the underlying function in the terminal package. It assumes that they know about them and we are once again in the bog of heterogeneity (i.e. syntax, interface, and argument differences between packages). We'd like to make changing options as easy as possible.

Also, tidymodels tends to lean on empirical validation a lot more than the average method. We want to facilitate model comparisons based on out-of-sample performance and not so much on p-values. Adding structure tuning parameters to glm-type ordinal models makes that easy to do.

So, let us know it you want to work on these bits. We'd be happy to help facilitate it if you want/need. If you disagree on the glm parts, we'd love to have work on the non-contentious parts (like random forests, etc). We'd probably want to have these in a separate package.

mattwarkentin commented 1 year ago

A little late to the party here, but FWIW, several months ago I started working on a parsnip-adjacent package (currently in a private repo) for implementing new parsnip models and engines for ordinal regression that I stopped working on when I got busy with other things. I would be interested in resurrecting this if there is still interest. I just want to avoid duplicating effort. Let me know!

At the time, I had considered MASS::polr(), ordinal::clm(), VGAM::vglm(), ordinalNet::ordinalNet(), and ordinalForest::ordfor() as engines.

simonpcouch commented 1 year ago

You'd not be duplicating effort (to my knowledge)! I'd say there is still interest here, and you're welcome to loop us in when you feel ready for testers and/or would appreciate technical help.

mattwarkentin commented 1 year ago

Okay cool, thanks for quick response, @simonpcouch. I probably can't spend much time on it until I finish a few things. But I can probably pick this up in about a month. I will check back then to see where things stand.

JavOrraca commented 1 month ago

@mattwarkentin I'm interested in the same kind of ordinal engines as you described and a tidymodels-friendly method would be awesome as an extension package. I'll keep an eye out and if you need extra hands, let me know!

corybrunson commented 1 month ago

I would also enjoy working on this!

Would the {parsnip} extension would be paired with an extension of or contribution to {yardstick} with additional metrics specifically designed for ordinal response variables? It would also be important to make existing numeric, class, and class probability metrics available for ordinal models, though that would seem to violate the current combination rules. (I'm just reading this comparison by Sakai.)

AmeliaMN commented 3 weeks ago

I'm realizing that I was part of the last issue thread and never responded. I agree with @tripartio about the types of things users would want out of ordinal regression functionality in tidymodels. @mattwarkentin, did you end up building some functionality? Is there something the folks on this thread can help with? I took a quick look at your repos and saw you have a fork of parsnip but I didn't see anything else that looked quite right.

topepo commented 3 weeks ago

@corybrunson

Would the {parsnip} extension would be paired with an extension of or contribution to {yardstick} with additional metrics specifically designed for ordinal response variables?

Yes, that describes the work well. I think that the yardstick work would go directly into that package and the modeling bits would go into an extension (the name ordered is not used on CRAN yet).

We could put a small extension together and just get it on CRAN then, with that blueprint, we can add more engines.

corybrunson commented 3 weeks ago

@topepo i'm digging into Tidymodels internals for a {recipes} extension, so i'd be glad to help draft a name-claim prototype as part of my familiarity-building.

topepo commented 3 weeks ago

@corybrunson That's great. Let us know if there is anything missing from the docs. I don't know that we'll need much for recipes with ordinal data.

topepo commented 3 weeks ago

For a first version of a modeling package extension, we could add ordinalForest and rpartScore since those would take minimal effort. We can add these as engines to the existing modeling functions.

I'll want to add ordinalNet but there are two complications:

corybrunson commented 3 weeks ago

@topepo thank you! Though, to clarify, the {recipes} extension is an unrelated project. I will raise an issue if i encounter persistent problems.

Is the plan for now for you to make your own prototype (alluded to in the linked comment) public? I would expect that even a prototype extension for ordinal models would include MASS::polr()—or would that also pose subtle design challenges?

topepo commented 3 weeks ago

Yes I would include MASS::polr() as well as the functions from the VGAM package.

mattwarkentin commented 3 weeks ago

Hi all. Sorry, I had good intentions to work on this package extension but at the time was moving across the country and starting a new position. I didn't really make much progress. Glad to see it is being picked up by others. These were some stream of consciousness notes I had scribbled down in a private issue that may or may not be helpful:

Engines:

  • [ ] MASS::polr()
  • [ ] ordinal::clm()
  • [ ] VGAM::vglm()
  • [ ] ordinalNet::ordinalNet()
  • [ ] ordinalForest::ordfor()

Looks like ordinalForest::ordfor() doesn't accept tibbles for data...

Also predict.ordfor doesn't give class probabilities. I have reached out to the author

  • Seems like ordfor() should be an engine for rand_forest()

  • polr(), clm(), vglm(), and ordinalNet() should be their own new model/engines

    • maybe ordinal_reg(...)
  • MASS::polr(), ordinal::clm(), VGAM::vglm(): formula method

  • ordinalNet::ordinalNet(): x/y method

  • ordinalForst::ordfor(): character string of Y var and data frame with Y+Xs

library(tidyverse)
library(tidymodels)
library(MASS)
library(ordinal)
library(VGAM)
library(ordinalNet)
library(ordinalForest)

df <- 
  as_tibble(housing) |> 
  uncount(Freq)

xm <- 
  recipe(Sat ~ Sat ~ Infl + Type + Cont, data = df) |> 
  step_dummy(all_predictors()) |> 
  prep() |> 
  juice(composition = 'matrix')

y <- df$Sat

p <- polr(Sat ~ Infl + Type + Cont, data = df)

c <- clm(Sat ~ Infl + Type + Cont, data = df)

v <- vglm(Sat ~ Infl + Type + Cont, data = df,
     family = cumulative(parallel = T))

n <- ordinalNet(x = xm, y = y)

f <- ordfor('Sat', data = as.data.frame(df), perffunction = 'probability')

predict(p, newdata = df, type = 'class')
predict(p, newdata = df, type = 'prob')

predict(c, newdata = df, type = 'linear.predictor')
predict(c, newdata = df, type = 'cum.prob')
predict(c, newdata = df, type = 'class')
predict(c, newdata = df, type = 'prob')

predict(v, newdata = df, type = 'response')
predict(v, newdata = df, type = 'link')

predict(n, newx = xm, type = 'class')
predict(n, newx = xm, type = 'response')
predict(n, newx = xm, type = 'link')

predict(f, newdata = df)
mattwarkentin commented 3 weeks ago

Regarding ordfor, after emailing the package author, I think both the tibble and class probabilities issues are solved. I think tibbles work now, and for probabilities...

To obtain probability predictions, you should set the 'perffunction' argument in 'ordfor()' to "probability" (perffunction="probability"). This will result in the use of the ranked probability score as the performance function. After applying the 'predict()' function to the resultant 'ordfor' object, the probabilities should not be displayed as NA. If this doesn't rectify the problem, please let me know so we can investigate further.

topepo commented 3 weeks ago

Thanks @mattwarkentin That's great work.

I threw this together over the weekend: https://github.com/topepo/ordered

I used that as the default for the model, and the prediction code checks it.

There is a bug in the package that triggers a false warning (I'll email the maintainer today).

We'll also need some new dial functions for some engine parameters.

topepo commented 3 weeks ago

TBH, I was hoping that someone else would make an extension so if someone is interested, I'd be happy to hand this off (or you can start one from scratch).

Ordinal regression has been relatively easy to do but we haven't had time to actually do it; I thought that a small initial package might kickstart the process.

topepo commented 3 weeks ago

Also, about this:

For a first version of a modeling package extension, we could add ordinalForest and rpartScore since those would take minimal effort. We can add these as engines to the existing modeling functions.

On review, the rpartScore implementation is problematic since it requires the outcome to be integers. As a consequence, rpart thinks that it is a regression model and will produce a numeric prediction.

mattwarkentin commented 3 weeks ago

@corybrunson Are you planning to run with developing the ordered package? I could get involved in a couple weeks after I wrap up presenting at a conference. But if you are ready to jump in, please feel free to take the lead.

corybrunson commented 3 weeks ago

@mattwarkentin i don't know that i should take the lead, but i'll put in some time as i have it!

I expect this will be a new kind of collaboration for me, so i'll try to err in favor of process. @topepo what's the typical Posit or Tidymodels protocol for co-development—if there is one?

And should we move this discussion to an issue thread on the new repo?

topepo commented 3 weeks ago

And should we move this discussion to an issue thread on the new repo?

We can keep it here.

what's the typical Posit or Tidymodels protocol for co-development—if there is one?

I don't know that it's different from others. We can follow some opinionated standards that we follow (most of the time). If you have not done much group collaboration, I highly recommend Davis's Tidyteam code review principles. It's a great reflection on how we we work together.

@mattwarkentin Do you want me to transfer the repo to you, or do you want to contribute to the existing one?

mattwarkentin commented 3 weeks ago

I don't think I have a strong preference. If I will be the long-term maintainer then it probably makes sense to transfer, but if you think once it is up-and-running that it will be under the tidymodels domain then I can contribute PRs to yours. What do you think?

topepo commented 2 weeks ago

I'll transfer it and will be happy to help if you have any issues.

topepo commented 2 weeks ago

Invitation sent!

mattwarkentin commented 2 weeks ago

Sorry, @topepo. I waited too long to accept and it expired. Can you please transfer again?

topepo commented 2 weeks ago