[models] Improve classification pipelines (resampling, tuning)

jsakv commented 3 years ago

Motivation

Currently, the XGBoost and Random Forest training pipelines defined within the models/pipelines.py module only implements preprocessing and scoring steps. Given that our training dataset is imbalanced, resampling is likely to improve performance. Moreover, the hyperparameters selected are not relevant to the current pipelines.

✨ Change

In order to improve the scoring pipeline performance, we should implement and test various resampling strategies. This would require to:

[ ] Test multiple sequences of preprocessing, resampling and modeling steps.
[ ] Select optimal hyperparameters
[ ] Account for production constraints. As discussed in the issue #31, the dataset features are only available on day D+5, so we would need to shift the target by 5 days.
[ ] (Optional) Add new preprocessing steps

Additional context

The dataset used for modeling is available with MergedEraFwiViirs() the documentation of the different features is available here https://pyronear.org/pyro-risks/
It's better to split the dataset using the TEST_SIZE and RANDOM_STATE values in the config module:
This article demonstrates how to set up a pipeline with imbalanced-learn SMOTE for Imbalanced Classification with Python
For avoiding skewing the training datasets (less accurate imputation of missing values) is better to define pipelines steps in the following order: preprocessing >> upsampling >> modeling

Hyperparameter tuning can be long for caching the steps that are not modify during the search which speeds up hyper-parameters tuning

from joblib import Memory
from imblearn.pipeline import Pipeline

location = './cachedir'
memory = Memory(location, verbose=0)

pipe = Pipeline([('step', Step)],
                memory=memory)

GHCamille commented 3 years ago

✨ Resampling situation assessement ✨

... and SMOTE first try

Current situation :

With SMOTE only (no additional tuning) :

Results :

smote_no_ratio

With SMOTE tuned with 3x ratio:

Results :

smote_ratio_3

With SMOTE tuned with 5x ratio:

Results :

smote_ratio_5_dataset

Conclusion

=> For now, the best SMOTE configuration is without any ratio tuning.

=> Let's see if it's better if we try a combination of too much over sampling + under sampling : https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/

[x] Try SMOTE+Tomek : 👎 same results as without the Tomek
[x] Try SMOTE+ENN : 👎 👎 👎 not a good combination for our data, it basically predicts that there will never be any fires
[x] Try ADASYN : 👎 👎 👎 basically same as SMOTE+ENN

jsakv commented 3 years ago

Thanks, @GHCamille that's a great improvement of the recall from the baseline pipelines! I am curious are you training the Random Forest classifier or the XGBoost classifier?

Also, it might help to play a little bit more with the class weights hyperparameters of the different classifiers for rebalancing the class during the sampling.

The RandomForestClassifier has already the class_weight parameter set to "balanced"
The XGBoostClassifierhas the parameter scale_pos_weight that we should set around this value sum(negative instances) / sum(positive instances)

We will see if it improves the performances a bit!

GHCamille commented 3 years ago

Hey, thanks ! Only RandomForest. I cannot run XGBoost on my computer for now.

Ok thank you for this info ! 👌

pyronear / pyro-risks