sktime / skpro

A unified framework for tabular probabilistic regression and probability distributions in python
https://skpro.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
232 stars 45 forks source link

[ENH] proba regression: reduction to multiclass classification #378

Closed fkiraly closed 2 months ago

fkiraly commented 3 months ago

From the discussion today, a short design for a reducer to multiclass classification mentioned in #7.

Parameters are:

The algortihm does as follows:

One could also think about another algorithm where the bins are cumulative, i.e., being contained in the bin defined by lowest point to i-th bin. This is also valid but one needs to be careful that the resulting cdf is monotonic. Could be a choice of strategy.

FYI @ShreeshaM07, @SaiRevanth25.

ShreeshaM07 commented 3 months ago

Yes I think this would be a good thing to implement once #335 is complete and merged.

ShreeshaM07 commented 3 months ago

I will be making the PR for this today had some few doubts needing clarification

fkiraly commented 3 months ago

since bins is going to represent the number of classes wouldn't it make more sense to fetch it from the sklearn classifier using the classes_ attribute?

But that's available only once you've fitted it, which is later than construction. How would that work, logically?

How do we take input of the other parameters to the different available classifiers in sklearn as they are going to be different for each do I take it as a kwargs argument from the user?

No, you pass the entire classifier instance. As I'm saying above, parameters are clf - a classifier instance with its own parameters - and bins. I did not state expressly that clf is an instance, though that would follow the common pattern of composition in sklearn-like manner, you use instances, not the class, so parameters of the instance are passed along with it.

ShreeshaM07 commented 3 months ago

No, you pass the entire classifier instance. As I'm saying above, parameters are clf - a classifier instance with its own parameters - and bins.

Oh I thought I had to take input as strings like I did in case of statsmodels. If I take the input as a sklearn classifier instance then thats not an issue at all.

ShreeshaM07 commented 3 months ago

But that's available only once you've fitted it, which is later than construction. How would that work, logically?

Since we are constructing the Histogram distribution only when we call predict_proba that would mean it is already fitted. Is that not how we want it ?

fkiraly commented 3 months ago

Oh I thought I had to take input as strings like I did in case of statsmodels. If I take the input as a sklearn classifier instance then thats not an issue at all.

Yes, inputs being strings is "bad design" if a viable alternative is the composition/strategy patterns. Because with strings, you always have to add the encoding manually, whereas in composition you can pass any component that is API compliant.

fkiraly commented 3 months ago

Since we are constructing the Histogram distribution only when we call predict_proba that would mean it is already fitted. Is that not how we want it ?

I think you still need the exact bins because you need to pass them to bins of the histogram distribution - knowing their number is not enough.