Closed fkiraly closed 2 months ago
Yes I think this would be a good thing to implement once #335 is complete and merged.
I will be making the PR for this today had some few doubts needing clarification
bins
is going to represent the number of classes wouldn't it make more sense to fetch it from the sklearn classifier using the classes_
attribute?kwargs
argument from the user?since bins is going to represent the number of classes wouldn't it make more sense to fetch it from the sklearn classifier using the
classes_
attribute?
But that's available only once you've fitted it, which is later than construction. How would that work, logically?
How do we take input of the other parameters to the different available classifiers in sklearn as they are going to be different for each do I take it as a
kwargs
argument from the user?
No, you pass the entire classifier instance. As I'm saying above, parameters are clf
- a classifier instance with its own parameters - and bins
. I did not state expressly that clf
is an instance, though that would follow the common pattern of composition in sklearn
-like manner, you use instances, not the class, so parameters of the instance are passed along with it.
No, you pass the entire classifier instance. As I'm saying above, parameters are
clf
- a classifier instance with its own parameters - andbins
.
Oh I thought I had to take input as strings like I did in case of statsmodels. If I take the input as a sklearn classifier instance then thats not an issue at all.
But that's available only once you've fitted it, which is later than construction. How would that work, logically?
Since we are constructing the Histogram distribution only when we call predict_proba
that would mean it is already fitted. Is that not how we want it ?
Oh I thought I had to take input as strings like I did in case of statsmodels. If I take the input as a sklearn classifier instance then thats not an issue at all.
Yes, inputs being strings is "bad design" if a viable alternative is the composition/strategy patterns. Because with strings, you always have to add the encoding manually, whereas in composition you can pass any component that is API compliant.
Since we are constructing the Histogram distribution only when we call
predict_proba
that would mean it is already fitted. Is that not how we want it ?
I think you still need the exact bins because you need to pass them to bins
of the histogram distribution - knowing their number is not enough.
From the discussion today, a short design for a reducer to multiclass classification mentioned in #7.
Parameters are:
sklearn
classifierclf
capable of multiclass classificationbins
arg, default = 10. Possible values are int, or an ordered list of float.The algortihm does as follows:
bins
is int, replaces this arg internally by that many bins, at the bins + 1 equally spaced quantiles of the empirical training distribution.fit
, fitsclf
to this binned training datapredict_proba
, usesclf.predict.proba
to obtain class probabilities, and uses these together with the bins frombins
to obtain aHistogram
distributionOne could also think about another algorithm where the bins are cumulative, i.e., being contained in the bin defined by lowest point to i-th bin. This is also valid but one needs to be careful that the resulting cdf is monotonic. Could be a choice of strategy.
FYI @ShreeshaM07, @SaiRevanth25.