improve genericty of statistical features

In app/preprocess.py / the function used to compute statistical features (average number of individuals) needs to be improved.
(TODO of line 200) Indeed, to make those calculus consistent, some data is filtered, nevertheless, the filter is an harcoded list of school year. This filter is used to prevent to compute statistical features using the target values when using the repository on past data. Nevertheless, it needs to be generalized.

To investigate, one could update this filter based on the data that is currently targeted. For instance, if the user is predicting for school year 2021-2022, then data from '2021-2022' should be removed, but prior data can be kept ('2018-2019', '2019-2020', '2020-2021') since they represent the past.

    remove_real_lines = all_data[(all_data["annee_scolaire"] != "2021-2022")]

A fix would be to define the list of data to exclude based on parameters used to call the app in order to make it consistent for both

using the model on past value to analyse its performance
using the model with future data
```
def add_statistical_features(all_data, list_of_period_to_exclude):
"""
compute statistical features using ratio, means etc
"""
# TODO improve filtering here and remove NANs
remove_real_lines = all_data[~(all_data["annee_scolaire"].isin(list_of_period_to_exclude)]
```
where list_of_period_to_exclude is a list of school_year (maybe a range of dates could be a nice evolution too since it would allow to consider recent data to update those features ?) that has been computed prior to the call to this function.

The model produces reliable forecasts until the end of school year 2020-2021 (~16 000 meals/day). But it it starts generating very low values (~7000 meals/day) from Sept. 2021 (see detailed outputs for Sept.-Dec. 2021. The initial function with the following at line 201: remove_real_lines = all_data[(all_data["annee_scolaire"] != "2019-2020") & (all_data["annee_scolaire"] != "2018-2019")] I replaced this line by: remove_real_lines = all_data[(all_data["annee_scolaire"] != "2020-2021")] and re-launched the model for the same Sept.-Dec 2021 period. The model outputs only slightly changed (see detailed results after code modification. I'm afraid this is not the source of the prediction errors since Sept. 2021.

nantesmetropole / school_meal_forecast_xgboost

improve genericty of statistical features #6