scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.85k stars 1.29k forks source link

Combine SMOTENC and TomekLink and Classifier together in a pipeline for Mixed Datatype Datasets #1082

Open Sehjbir opened 6 months ago

Sehjbir commented 6 months ago

Description:

I have a dataset which contains both numeric and categorical variables. I want to combine oversampling and under-sampling together. SMOTEOMEK is only applicable to pure numeric dataset.

Code Snippet:

model_oversampler_smotenc = make_pipeline(
    SMOTENC(random_state=44, categorical_features= category_cols),
    TomekLinks(sampling_strategy='auto'),
    GradientBoostingClassifier())

scoring=['balanced_accuracy', 'f1', 'precision', 'recall']
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=3)
cv_results_oversampler_smotenc = cross_validate(
    model_oversampler_smotenc, data_train , target_train, scoring=scoring,
    return_train_score=True, return_estimator=True, cv=cv,
    n_jobs=-1)

print(
    f"Balanced accuracy mean +/- std. dev.: "
    f"{cv_results_oversampler_smotenc['test_balanced_accuracy'].mean():.3f} +/- "
    f"{cv_results_oversampler_smotenc['test_balanced_accuracy'].std():.3f}"

Questions: