So far we have only implemented two training pipelines (XGBoost and Random Forest), but I think we should explore different algorithms and modeling strategies!
🚀 Feature
✨ LightGBM pipeline ✨
Design Proposal
In order to comply with the pyro_risks project the LightGBM scoring pipeline should:
Comply with the scikit-learn API (model step → LGBMClassifier)
Defined as an imbalanced-learn Pipeline
Defined with the hyperparameters set in the config/models.py
keeping up with updates mentioned in the issue #46 would require to:
[ ] Implement a resampling steps.
[ ] Select optimal hyperparameters
[Â ] Account for production constraints. As discussed in the issue #31 , the dataset features are only available on day D+5, so we would need to shift the target by 5 days.
[Â ] (Optional) Add new preprocessing steps
Additional context
The dataset used for modeling is available with MergedEraFwiViirs() the documentation of the different features is available here https://pyronear.org/pyro-risks/
It's better to split the dataset using the TEST_SIZE and RANDOM_STATE values in the config module:
For avoiding skewing the training datasets (less accurate imputation of missing values) is better to define pipelines steps in the following order: preprocessing >> upsampling >> modeling
Hyperparameter tuning can be long, use the script bellow for caching pipeline steps during the search and speeding up hyperparameter tuning
from joblib import Memory
from imblearn.pipeline import Pipeline
location = './cachedir'
memory = Memory(location, verbose=0)
pipe = Pipeline([('step', Step)],
memory=memory)
Motivation
So far we have only implemented two training pipelines (XGBoost and Random Forest), but I think we should explore different algorithms and modeling strategies!
🚀 Feature
✨ LightGBM pipeline ✨
Design Proposal
In order to comply with the pyro_risks project the LightGBM scoring pipeline should:
LGBMClassifier
)imbalanced-learn
Pipelineconfig/models.py
keeping up with updates mentioned in the issue #46 would require to:
Additional context
MergedEraFwiViirs()
the documentation of the different features is available here https://pyronear.org/pyro-risks/TEST_SIZE
andRANDOM_STATE
values in theconfig
module:preprocessing >> upsampling >> modeling
Hyperparameter tuning can be long, use the script bellow for caching pipeline steps during the search and speeding up hyperparameter tuning