Closed jsakv closed 7 months ago
... and SMOTE first try
Current situation :
With SMOTE only (no additional tuning) :
Results :
With SMOTE tuned with 3x ratio:
Results :
With SMOTE tuned with 5x ratio:
Results :
=> For now, the best SMOTE configuration is without any ratio tuning.
=> Let's see if it's better if we try a combination of too much over sampling + under sampling : https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/
Thanks, @GHCamille that's a great improvement of the recall from the baseline pipelines! I am curious are you training the Random Forest classifier or the XGBoost classifier?
Also, it might help to play a little bit more with the class weights hyperparameters of the different classifiers for rebalancing the class during the sampling.
RandomForestClassifier
has already the class_weight
parameter set to "balanced"
XGBoostClassifier
has the parameter scale_pos_weight
that we should set around this value sum(negative instances) / sum(positive instances)
We will see if it improves the performances a bit!
Hey, thanks ! Only RandomForest. I cannot run XGBoost on my computer for now.
Ok thank you for this info ! 👌
Motivation
Currently, the XGBoost and Random Forest training pipelines defined within the
models/pipelines.py
module only implements preprocessing and scoring steps. Given that our training dataset is imbalanced, resampling is likely to improve performance. Moreover, the hyperparameters selected are not relevant to the current pipelines.✨ Change
In order to improve the scoring pipeline performance, we should implement and test various resampling strategies. This would require to:
Additional context
MergedEraFwiViirs()
the documentation of the different features is available here https://pyronear.org/pyro-risks/TEST_SIZE
andRANDOM_STATE
values in theconfig
module:preprocessing >> upsampling >> modeling
Hyperparameter tuning can be long for caching the steps that are not modify during the search which speeds up hyper-parameters tuning