nantesmetropole / school_meal_forecast_xgboost

MIT License
4 stars 4 forks source link

improve genericty of statistical features #6

Open nicolas-vtg opened 2 years ago

nicolas-vtg commented 2 years ago

In app/preprocess.py / the function used to compute statistical features (average number of individuals) needs to be improved.
(TODO of line 200) Indeed, to make those calculus consistent, some data is filtered, nevertheless, the filter is an harcoded list of school year. This filter is used to prevent to compute statistical features using the target values when using the repository on past data. Nevertheless, it needs to be generalized.

To investigate, one could update this filter based on the data that is currently targeted. For instance, if the user is predicting for school year 2021-2022, then data from '2021-2022' should be removed, but prior data can be kept ('2018-2019', '2019-2020', '2020-2021') since they represent the past.

    remove_real_lines = all_data[(all_data["annee_scolaire"] != "2021-2022")]

A fix would be to define the list of data to exclude based on parameters used to call the app in order to make it consistent for both

fBedecarrats commented 2 years ago

The model produces reliable forecasts until the end of school year 2020-2021 (~16 000 meals/day). But it it starts generating very low values (~7000 meals/day) from Sept. 2021 (see detailed outputs for Sept.-Dec. 2021. The initial function with the following at line 201: remove_real_lines = all_data[(all_data["annee_scolaire"] != "2019-2020") & (all_data["annee_scolaire"] != "2018-2019")] I replaced this line by: remove_real_lines = all_data[(all_data["annee_scolaire"] != "2020-2021")] and re-launched the model for the same Sept.-Dec 2021 period. The model outputs only slightly changed (see detailed results after code modification. I'm afraid this is not the source of the prediction errors since Sept. 2021.