[models] Add LightGBM classification pipeline

Motivation

So far we have only implemented two training pipelines (XGBoost and Random Forest), but I think we should explore different algorithms and modeling strategies!

🚀 Feature

✨ LightGBM pipeline ✨

Design Proposal

In order to comply with the pyro_risks project the LightGBM scoring pipeline should:

Comply with the scikit-learn API (model step → LGBMClassifier)
Defined as an imbalanced-learn Pipeline
Defined with the hyperparameters set in the config/models.py

keeping up with updates mentioned in the issue #46 would require to:

[ ] Implement a resampling steps.
[ ] Select optimal hyperparameters
[ ] Account for production constraints. As discussed in the issue #31 , the dataset features are only available on day D+5, so we would need to shift the target by 5 days.
[ ] (Optional) Add new preprocessing steps

Additional context

The dataset used for modeling is available with MergedEraFwiViirs() the documentation of the different features is available here https://pyronear.org/pyro-risks/
It's better to split the dataset using the TEST_SIZE and RANDOM_STATE values in the config module:
This article demonstrates how to set up a pipeline with imbalanced-learn SMOTE for Imbalanced Classification with Python
For avoiding skewing the training datasets (less accurate imputation of missing values) is better to define pipelines steps in the following order: preprocessing >> upsampling >> modeling

Hyperparameter tuning can be long, use the script bellow for caching pipeline steps during the search and speeding up hyperparameter tuning

from joblib import Memory
from imblearn.pipeline import Pipeline

location = './cachedir'
memory = Memory(location, verbose=0)

pipe = Pipeline([('step', Step)],
                memory=memory)

pyronear / pyro-risks

[models] Add LightGBM classification pipeline #47

Motivation

🚀 Feature

Design Proposal

Additional context