pyronear / pyro-risks

Data science for wildfire risk forecasting and monitoring
https://pyronear.github.io/pyro-risks
Apache License 2.0
25 stars 8 forks source link

[models] Improve classification pipelines (resampling, tuning) #46

Closed jsakv closed 7 months ago

jsakv commented 3 years ago

Motivation

Currently, the XGBoost and Random Forest training pipelines defined within the models/pipelines.py module only implements preprocessing and scoring steps. Given that our training dataset is imbalanced, resampling is likely to improve performance. Moreover, the hyperparameters selected are not relevant to the current pipelines.

✨ Change

In order to improve the scoring pipeline performance, we should implement and test various resampling strategies. This would require to:

Additional context

GHCamille commented 3 years ago

✨ Resampling situation assessement ✨

... and SMOTE first try

Current situation :

Capture d’écran 2021-05-04 à 19 39 56

With SMOTE only (no additional tuning) :

Capture d’écran 2021-05-04 à 19 40 02

Results :

smote_no_ratio

With SMOTE tuned with 3x ratio:

Capture d’écran 2021-05-04 à 19 40 06

Results :

smote_ratio_3

With SMOTE tuned with 5x ratio:

Capture d’écran 2021-05-04 à 19 40 09

Results :

smote_ratio_5_dataset

Conclusion

=> For now, the best SMOTE configuration is without any ratio tuning.

=> Let's see if it's better if we try a combination of too much over sampling + under sampling : https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/

jsakv commented 3 years ago

Thanks, @GHCamille that's a great improvement of the recall from the baseline pipelines! I am curious are you training the Random Forest classifier or the XGBoost classifier?

Also, it might help to play a little bit more with the class weights hyperparameters of the different classifiers for rebalancing the class during the sampling.

We will see if it improves the performances a bit!

GHCamille commented 3 years ago

Hey, thanks ! Only RandomForest. I cannot run XGBoost on my computer for now.

Ok thank you for this info ! 👌