FBruzzesi commented 10 months ago

Description

So I was looking for implementation(s) of ordinal classification in python (possibly scikit-learn compatible).

There exists a library called mord that implements a few strategies. And then I stumbled upon a simple approach to ordinal classification (*) which is also discussed in a "needs decision" scikit-learn issue.

This feature is not currently supported by scikit-learn, but it will potentially be in the future. I am not sure what the stance is towards this kind of situations. Considering the different speed at which we can move it would be a nice addition in my opinion.

(*) The paper

The idea of the paper is quite simple (and very meta): turn a ordinal classification problem with K classes, into K-1 binary problems. Example in image:

koaning commented 10 months ago

I'm certainly not closed to the idea, but after glancing at it ... it does seem like mord is already a pretty popular package with ~9K downloads a month. I'm not against adding such a feature here, but it might be good to confirm we're not doing something that mord is already doing.

Before investigating though, do we have a dataset/task that we could use to benchmark/docs?

FBruzzesi commented 10 months ago

something that mord is already doing

They seem to implement methods with custom losses (as they explain in the paper On the Consistency of Ordinal Regression Methods, "Finally, our analysis suggests a novel surrogate of the squared error loss").

On the other hand, a simple approach to ordinal classification is agnostic to the underlying classifier used (and its loss function).

do we have a dataset/task that we could use to benchmark/docs?

A quick one to explore could be the same that mord uses, namely the Copenhagen housing survey dataset, which however does not have many samples (~1681~ 72)
In the paper a simple approach to ordinal classification, the authors used some "benchmark datasets and converted the numeric target values into ordinal quantities using equal-frequency binning".
Potentially any dataset with some (discrete) ratings would work.

koaning commented 10 months ago

The Copenhagen housing dataset might just be fine. Might be worth including as a dataset of this library too.

But yeah, if you feel like exploring this and have a benchmark that demonstrates the merit I'd be totally open to adding it here!

FBruzzesi commented 10 months ago

Actually just realized that the Copenhagen housing survey dataset has only 72 samples.

One larger option to use could be the same dataset adopted by in a statsmodel example on ordinal regression from UCLA website with 400 observations.

To be honest I wasn't able to find much else 😞

koaning commented 10 months ago

I guess there are some sentiment analysis datasets that might work? Maybe some amazon reviews where folks have to give 1-5 star ratings? I think the statsmodel one might also be fine. The goal here is to convince ourselves that the method has merit. To do that we'd need to at least check with a dataset, but it's up to the end users of our library to do proper benchmarking themselves.

FBruzzesi commented 10 months ago

I just made a quick gist to compare the WIP implementation of OrdinalClassifier in my fork and sklearn OneVsRestClassifier on the UCLA dataset mentioned above.

The results seem comparable but slightly favorable to OrdinalClassifier for 2 out of 3 algorithms. I would be curious to reproduce the entire experiment of the paper at this point.

What do you think?

BTW the majority of class functionalities are there, I would like to add .score(), unit tests and a documentation section if we decide to move forward.

koaning commented 10 months ago

What do you think?

I'm curious, any reason why you're comparing against OneVsRestClassifier instead of the "normal" model without that meta estimator around it?

The benchmark shows that we're in the same ballpark, but even if we don't exceed a standard model ... it might be good enough given that our approach adds a nice constraint. Is it possible to create an example prediction where it's clear that a gradient boosted tree violates the ordinal assumption with the new meta estimator? If we can show that the performance is in the same ballpark but that this meta estimator adds a guarantee of ordinal behaviour then I'm also totally in favour of adding it straight away.

FBruzzesi commented 10 months ago

I'm curious, any reason why you're comparing against OneVsRestClassifier instead of the "normal" model without that meta estimator around it?

It was to avoid that each algorithms uses a different internal to run on multiclass. I just ran without OneVsRestClassifier and it didn't really change the comparison directions.

Is it possible to create an example prediction where it's clear that a gradient boosted tree violates the ordinal assumption with the new meta estimator?

I am having a hard time to come up with something which is not highly synthetic and without limiting the boosted trees depth (which however is then the same in both meta models).

As a side note, a nice feature is that we need to fit one less model though😊

koaning commented 10 months ago

I got this strange bug when I tried running your notebook.

CleanShot 2024-01-22 at 16 55 07

Could you open up a PR though? That'll make it easier for me to have a look/play with. As a maintainer, I also think it's fine for you to open up WIP PRs at any time. Makes collab/checking via codespaces a bunch easier.

FBruzzesi commented 10 months ago

Sure thing! I just didn't want to sneak a notebook in the codebase though. Maybe the notebook needs restarting after installation!?

WIP PR is coming 😊

koaning commented 10 months ago

I just realised that there's a pretty elaborate use-case for ordinal regression in quantile-regression land.

koaning commented 10 months ago

I'll keep the discussion going here since the PR should be mainly about code.

I've changed this function:

def compare_meta_models(base_estimator, X, y, scoring) -> pd.DataFrame:

    oc_estimator = OrdinalClassifier(clone(base_estimator), use_calibration=True, n_jobs=-1)
    print("OrdinalClassifier probas")
    print(oc_estimator.fit(X, y).predict_proba(X))
    oc_scores = score_estimator(oc_estimator, X, y, scoring)

    ovr_estimator = OneVsRestClassifier(clone(base_estimator), n_jobs=-1)
    ovr_scores = score_estimator(ovr_estimator, X, y, scoring)
    print("Base estimator probas")
    print(base_estimator.fit(X, y).predict_proba(X))

    scores = pd.merge(oc_scores, ovr_scores, left_index=True, right_index=True, suffixes=["_oc", "_ovr"])
    return (scores.reindex(sorted(scores.columns), axis=1))

I'm curious to look at the probas ... here's what I get for the histogram boosted tree.

HistGradientBoostingClassifier
OrdinalClassifier probas
[[7.52069855e-01 1.11582408e-01 1.36347738e-01]
 [4.18325761e-02 4.56932439e-01 5.01234985e-01]
 [3.92271567e-01 3.41473011e-01 2.66255422e-01]
 ...
 [9.57815273e-01 4.19909575e-02 1.93769722e-04]
 [7.52069855e-01 1.11582408e-01 1.36347738e-01]
 [5.96030509e-01 2.90996042e-01 1.12973449e-01]]

Base estimator probas
[[6.65751058e-01 1.93395393e-01 1.40853549e-01]
 [1.23919318e-01 5.19569450e-01 3.56511232e-01]
 [4.53215748e-01 3.42339154e-01 2.04445098e-01]
 ...
 [8.91230717e-01 1.07933657e-01 8.35626169e-04]
 [6.65751058e-01 1.93395393e-01 1.40853549e-01]
 [5.39872329e-01 3.14106728e-01 1.46020943e-01]]

The probas on that first row there ... [0.75, 0.11, 0.14] notice how there is a canyon between two higher numbers. I guess this is happening because the underlying model can take any shape ... but I'm wondering if there's anything we can do to guarantee that the output is properly ordinal.

koaning commented 10 months ago

For what it's worth, @FBruzzesi if you still feel it's a cool idea to add, I think I'm in favor. The paper itself demonstrates a fair enough benchmark, though I do hope the community can find a more compelling use-case. I'm myself curious it it may help in a quantile regression use-case, but that's a super experimental brainfart, no clue if it'll help.

FBruzzesi commented 10 months ago

I just realised that there's a pretty elaborate use-case for ordinal regression in quantile-regression land

Just on time 😁

The probas on that first row there ... [0.75, 0.11, 0.14] notice how there is a canyon between two higher numbers.

Yes true! That's an issue. Ideally one would expect 2 possible shapes (or actually one):

Max in the "corner" (first or last) class and then monotonic decreasing probabilities: $P(C_1) > P(C_2) > ... > P(C_n)$ or $P(C_1) < P(C_2) < ... < P(C_n)$
Max in "middle" class and then monotonic decreasing on the "sides": $P(C_1) < ... < P(C_k) > …>P(C_n)$

The example you just reported should fall into the first case but violates monotonicity 😕

I guess this is happening because the underlying model can take any shape

Most likely yes, and that is true for the out of the box multiclass methods from scikit-learn as well. I would argue that is the reason why probabilities should be calibrated properly within each model (addressing the comment in the PR) and not only at the end.

but I'm wondering if there's anything we can do to guarantee that the output is properly ordinal.

Let me think about this more deeply when I have some time

koaning commented 10 months ago

If we can't guarantee it now that's fine I think ... but it would be something to keep in mind for later. Calibration might help ... but it would be nice to demonstrate that with a benchmark. The logistic regression models seem fine here, it's more the tree models that I'm worried about.

But again, I'm also fine with adding a proper constraint later.

FBruzzesi commented 10 months ago

Hey @koaning , I double checked the "calibration" thinghy, and apparently the main issue was how I was using calibration after fitting the model.

I just committed the following changes in the PR:

Added unit tests
Added docs in meta user guide
Changed how the "calibration" part of the meta estimator works.
Added a section in the notebook in which I check for monotonicity constraint to be satisfied. When fitting the model this was, you can see that the constraint is always satisfied if fitting a calibrated estimator, but not in other cases.

koaning / scikit-lego

[FEATURE] Meta Ordinal Classification #607

Description

(*) The paper