paper: The Effect of Class Distribution on Classifier Learning: An Empirical Study

Sandy4321 commented 4 years ago

Friends there is interesting discussion what can be done better https://github.com/catboost/catboost/issues/392#issuecomment-647048665 @ShaharKatz and reference to this paper The Effect of Class Distribution on Classifier Learning: An Empirical Study https://pdfs.semanticscholar.org/8939/585e7d464703fe0ec8ca9fc6acc3528ce601.pdf

glemaitre commented 4 years ago

Could you elaborate on what the method is doing?

ShaharKatz commented 4 years ago

Sure. Through empirical research it has been shown that class imbalance does not necessarily mean worse performance. The proposed method consists of grid search-ing on several (and reverse) re-sampling to produce several classifiers and pick the best one. A bias correction is made (easily in the case of trees) in the form of a higher or lower threshold per leaf based on the original-to-resample relation rate (e.g. if you sampled a class twice as frequent the threshold for classification to that class should be twice as high per leaf). The annotated version shows what i believe to be the essence.

Sandy4321 commented 4 years ago

@ShaharKatz do you know some python code to illustrate this ? for any classier ?

ShaharKatz commented 4 years ago

As far as I read, the article itself doesn't come with the source. The intuition and example case is pretty straightforward - take an imbalanced binary label dataset that is perfectly linear separable. SVM shouldn't have a problem with this dataset and by downsampling the majority you can loose the fine-tuning of the edge. I think first i'll show that this actually works for a dataset with reproducible code and than if the results are good, incorporate it.

glemaitre commented 4 years ago

It looks to me as it would be easy to do with normal scikit-learn/imbalanced-learn components in 3 lines of code (pipeline + grid-search).

from sklearn.datasets import load_iris
from imblearn.datasets import make_imbalance
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

data = load_iris()
X, y = data.data, data.target

X, y = make_imbalance(
    X, y, sampling_strategy={0: 10, 1: 20, 2: 50}, random_state=42
)

model = Pipeline(
    [("sampler", SMOTE()),
     ("scaler", StandardScaler()),
     ("clf", LogisticRegression())]
)

param_grid = {
    "sampler__sampling_strategy": [
        {0: 20, 1: 30}, {0: 30, 1: 30}, {0: 30, 1: 20},
    ]
}
grid = GridSearchCV(model, param_grid=param_grid).fit(X, y)
print(grid.best_params_)

glemaitre commented 4 years ago

So if I don't miss anything, IMO, it would not be worth creating an estimator that would wrap all possible parameters of these models while it seems pretty easy to create a pipeline in this case.

WDYT?

ShaharKatz commented 4 years ago

Maybe the current implementation already addresses this, but the over/under sampling is the first part, the second part (that might still needs implementation) is the de-bias-ing of the estimator's results. The article shows it for trees (which is very intuitive), for logistic regression we need a different correction though..

Sandy4321 commented 4 years ago

Yes the de-bias-ing of the estimator's results Wild be great to add this

Sandy4321 commented 4 years ago

What is problem ShaharKatz What's to do Let him do it And after you can test how good is it?

Sandy4321 commented 4 years ago

If ShaharKatz wants to it Why to stop him?

glemaitre commented 4 years ago

@Sandy4321 We have to be careful when adding a new algorithm in the source code. Basically code comes with the responsibility to maintain it. So we need to weigh the benefit and limitation of the current solution and decide if this is worth or not to add it.

This said I did not look at the paper yet so I cannot say if this is worth or not. When speaking about debiasing, I would think that this should be linked to the scoring used during the fit of the GridSearchCV and might be implemented using make_scorer from scikit-learn.

glemaitre commented 4 years ago

OK, so I see that the debiasing is actually a ratio at the leaf level in the tree. So this should be added in the tree code base from scikit-learn directly. I am wondering if it could always be applied even when not resampling the dataset?

glemaitre commented 4 years ago

One thing that I am not sure is how well this method works with deep trees where you will have very few samples in the leaf.

ShaharKatz commented 4 years ago

Regarding the implementation @glemaitre suggested - this isn't really a simple scorer since it must know the resampling technique used in the pre-processing stage. On the other hand, this isn't really a preprocessing step since it obviously takes action during the inference.

On one hand this shouldn't be model specific since most models don't do the resampling inside (which is why this repo comes in handy) but on the other hand the model implementation is relevant since it goes down to the leaf level (in trees).

It's:

An adjacent model to the actual model being trained.
it is fitted using the resampling technique(s) used.
it is model-specific (doesn't work just on the proba).

This is the reason i think this repo is the best place for it - cause it deals specifically with imbalanced learning and it can take this "hybrid" which doesn't necessarily plays nice with the existing interfaces.

Sandy4321 commented 4 years ago

ShaharKatz

Sandy4321 commented 4 years ago

@ShaharKatz If it is so complicated to incorporate your great suggestion to this package Would you like to create stand alone Python package to benefit all of us After you can add your code to this package, when package owner will test your code ... Please do not give up, we need your code...

chkoar commented 4 years ago

The article shows it for trees (which is very intuitive), for logistic regression we need a different correction though..

Could this generalized across predictors (1)?

On the other hand, this isn't really a preprocessing step since it obviously takes action during the inference.

If the answer to (1) is yes then I suppose that the above sentence indicates that the solutions it could be a meta-estimator, no?

Sandy4321 commented 4 years ago

hi why against this definite improvement? @ShaharKatz may you do it as separate package? we need your code

ShaharKatz commented 4 years ago

regarding your question - @chkoar - this is model specific, we have a solution for trees and i'm currently looking at a solution for logistic regression. Don't have a general framework yet. @Sandy4321 - I want to see that it provides value and can be generalised. If the generalisation allows for this to be a meta-estimator than there's no problem committing the code here, if not than yes, this would require a different project

scikit-learn-contrib / imbalanced-learn

paper: The Effect of Class Distribution on Classifier Learning: An Empirical Study #730