Open Sandy4321 opened 4 years ago
Could you elaborate on what the method is doing?
Sure. Through empirical research it has been shown that class imbalance does not necessarily mean worse performance. The proposed method consists of grid search-ing on several (and reverse) re-sampling to produce several classifiers and pick the best one. A bias correction is made (easily in the case of trees) in the form of a higher or lower threshold per leaf based on the original-to-resample relation rate (e.g. if you sampled a class twice as frequent the threshold for classification to that class should be twice as high per leaf). The annotated version shows what i believe to be the essence.
@ShaharKatz do you know some python code to illustrate this ? for any classier ?
As far as I read, the article itself doesn't come with the source. The intuition and example case is pretty straightforward - take an imbalanced binary label dataset that is perfectly linear separable. SVM shouldn't have a problem with this dataset and by downsampling the majority you can loose the fine-tuning of the edge. I think first i'll show that this actually works for a dataset with reproducible code and than if the results are good, incorporate it.
It looks to me as it would be easy to do with normal scikit-learn/imbalanced-learn components in 3 lines of code (pipeline + grid-search).
from sklearn.datasets import load_iris
from imblearn.datasets import make_imbalance
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
data = load_iris()
X, y = data.data, data.target
X, y = make_imbalance(
X, y, sampling_strategy={0: 10, 1: 20, 2: 50}, random_state=42
)
model = Pipeline(
[("sampler", SMOTE()),
("scaler", StandardScaler()),
("clf", LogisticRegression())]
)
param_grid = {
"sampler__sampling_strategy": [
{0: 20, 1: 30}, {0: 30, 1: 30}, {0: 30, 1: 20},
]
}
grid = GridSearchCV(model, param_grid=param_grid).fit(X, y)
print(grid.best_params_)
So if I don't miss anything, IMO, it would not be worth creating an estimator that would wrap all possible parameters of these models while it seems pretty easy to create a pipeline in this case.
WDYT?
Maybe the current implementation already addresses this, but the over/under sampling is the first part, the second part (that might still needs implementation) is the de-bias-ing of the estimator's results. The article shows it for trees (which is very intuitive), for logistic regression we need a different correction though..
Yes the de-bias-ing of the estimator's results Wild be great to add this
What is problem ShaharKatz What's to do Let him do it And after you can test how good is it?
If ShaharKatz wants to it Why to stop him?
@Sandy4321 We have to be careful when adding a new algorithm in the source code. Basically code comes with the responsibility to maintain it. So we need to weigh the benefit and limitation of the current solution and decide if this is worth or not to add it.
This said I did not look at the paper yet so I cannot say if this is worth or not. When speaking about debiasing, I would think that this should be linked to the scoring used during the fit of the GridSearchCV
and might be implemented using make_scorer
from scikit-learn.
OK, so I see that the debiasing is actually a ratio at the leaf level in the tree. So this should be added in the tree code base from scikit-learn directly. I am wondering if it could always be applied even when not resampling the dataset?
One thing that I am not sure is how well this method works with deep trees where you will have very few samples in the leaf.
Regarding the implementation @glemaitre suggested - this isn't really a simple scorer since it must know the resampling technique used in the pre-processing stage. On the other hand, this isn't really a preprocessing step since it obviously takes action during the inference.
On one hand this shouldn't be model specific since most models don't do the resampling inside (which is why this repo comes in handy) but on the other hand the model implementation is relevant since it goes down to the leaf level (in trees).
It's:
This is the reason i think this repo is the best place for it - cause it deals specifically with imbalanced learning and it can take this "hybrid" which doesn't necessarily plays nice with the existing interfaces.
ShaharKatz
@ShaharKatz If it is so complicated to incorporate your great suggestion to this package Would you like to create stand alone Python package to benefit all of us After you can add your code to this package, when package owner will test your code ... Please do not give up, we need your code...
The article shows it for trees (which is very intuitive), for logistic regression we need a different correction though..
Could this generalized across predictors (1)?
On the other hand, this isn't really a preprocessing step since it obviously takes action during the inference.
If the answer to (1) is yes then I suppose that the above sentence indicates that the solutions it could be a meta-estimator, no?
hi why against this definite improvement? @ShaharKatz may you do it as separate package? we need your code
regarding your question - @chkoar - this is model specific, we have a solution for trees and i'm currently looking at a solution for logistic regression. Don't have a general framework yet. @Sandy4321 - I want to see that it provides value and can be generalised. If the generalisation allows for this to be a meta-estimator than there's no problem committing the code here, if not than yes, this would require a different project
Friends there is interesting discussion what can be done better https://github.com/catboost/catboost/issues/392#issuecomment-647048665 @ShaharKatz and reference to this paper The Effect of Class Distribution on Classifier Learning: An Empirical Study https://pdfs.semanticscholar.org/8939/585e7d464703fe0ec8ca9fc6acc3528ce601.pdf