Closed lisphilar closed 3 years ago
I tried "regressor__max_depth": [1, 2, 3, 4, 5, 6, 7, 8, 9]
as a hyperparameter of DecisionTreeRegressor
, but R2 score was decreased. (before: [3, 5, 7, 9]
)
{'best': 'Indicators -> Parameters with Decision Tree Regressor', 'pca': <class 'sklearn.decomposition._pca.PCA'>, 'regressor': <class 'sklearn.tree._classes.DecisionTreeRegressor'>, 'pca_n_components': 1, 'score_metric': 'R2', 'score_train': 0.8691580023243415, 'score_test': -1.449940666224947, 'delay': 16}
At version 2.20.3-alpha with Italy data and today="06Jun2021",
{'best': 'Indicators -> Parameters with Decision Tree Regressor', 'pca': <class 'sklearn.decomposition._pca.PCA'>, 'regressor': <class 'sklearn.tree._classes.DecisionTreeRegressor'>, 'pca_n_components': 1, 'score_metric': 'R2', 'score_train': 0.09405397649269268, 'score_test': -3.233491027704678, 'delay': [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}
Using "regressor__max_depth": list(range(1, 10))` as the grid of max depth of decision tree regressor,
{'best': 'Indicators -> Parameters with Decision Tree Regressor', 'pca': <class 'sklearn.decomposition._pca.PCA'>, 'regressor': <class 'sklearn.tree._classes.DecisionTreeRegressor'>, 'pca_n_components': 1, 'score_metric': 'R2', 'score_train': -3.084686247190538, 'score_test': -2.8895351215217864, 'delay': [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}
{'best': 'Indicators -> Parameters with Decision Tree Regressor', 'scaler': <class 'sklearn.preprocessing._data.MinMaxScaler'>, 'pca': <class 'sklearn.decomposition._pca.PCA'>, 'regressor': <class 'sklearn.tree._classes.DecisionTreeRegressor'>, 'pca_n_components': 8, 'score_metric': 'R2', 'score_train': 0.918263219376156, 'score_test': -1.2331265904202067, 'delay': [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}
Current status for Italy/06Jun2021
With #826, add light gradient boosting machine regressor for forecasting with lightgbm
package.
https://github.com/microsoft/LightGBM/tree/master/python-package
In RegressorBase._split()
, backword filling (pandas.DataFrame.bfill()
) was done with delayed indicator values to keep the number of train/test data.
X_delayed = X.copy()
for delay in delay_values:
X_delayed = X_delayed.join(X.shift(delay, freq="D"), how="outer", rsuffix=f"_{delay}")
X_delayed = X_delayed.ffill().bfill()
However, this lead over fitting because un-existed combination of parameter values and indicators were created. This should be like this. (pull request #827)
X_delayed = X.copy()
for delay in delay_values:
X_delayed = X_delayed.join(X.shift(delay, freq="D"), how="outer", rsuffix=f"_{delay}")
X_delayed = X_delayed.ffill() # .bfill()
With #827,
{'best': 'Indicators -> Parameters with Decision Tree Regressor', 'converterto_convert': False, 'pca__n_components': 0.9, 'regressormax_depth': 6, 'converter': <class 'covsirphy.regression.reg_rate_converter._RateConverter'>, 'scaler': <class 'sklearn.preprocessing._data.MinMaxScaler'>, 'pca': <class 'sklearn.decomposition._pca.PCA'>, 'regressor': <class 'sklearn.tree._classes.DecisionTreeRegressor'>, 'score_metric': 'R2', 'score_train': 0.9835082445767591, 'score_test': -0.3647405411172536, 'delay': [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}
We need to continue updating, but somewhat improved with this change.
WIth #837, example/scenario_analysis.py will be updated to include vaccine data as X.
I thought test score R-squared << 0 is due only to lower performance of regression models, but it was not correct. We should change the default metric to evaluate performance of regression models more effectively.
Quated from https://link.medium.com/yZsE4RoIdhb
The R-Squared formula compares our fitted regression line to a baseline model. This baseline model is considered the “worst” model. The baseline model is a flat-line that predicts every value of y will be the mean value of y. R-Squared checks to see if our fitted regression line will predict y better than the mean will.
Y (ODE parameter) values of train+test dataset may show multimodal distributions. Let's say Y values of train dataset show multimodal distributions with mean value 0.3, but Y values of test dataset show unimodal distributions (i.e. subsetted from one peak) with mean value 0.1 with small variance.
Train: R-squared will be positive because the diferences of mean value 0.3 and samples are large. Baseline model is worse as expected. This leads "MSE(model)" will be smaller than "MSE(baseline)".
Test: R-squared will be negative because the diferences of mean value 0.1 and samples are small. Baseline model good un-expectedlly. This leads "MSE(model)" will be much larger than "MSE(baseline)".
Please refer to https://link.medium.com/yZsE4RoIdhb https://link.medium.com/iGKo8W4Ldhb
For a while, MAPE (mean absolute percentage error) score will be used as the evaluation metric because this is scale-independent and easy to interpret.
With #839,
Scenario.fit()
via Scenario.fit_accuracy()
.The figure and scores (MAPE: 59.40% for training, 58.74% for test) indicates that ovefitting is not a problem. Underfitting will be the next problem.
Summary
Decision tree regressor for forecasting shows overfitting. Train score (0.857) >> Test score (-0.670) in the following example. (Italy, today=06Jun2021)
Codes
example/scenario_analysis.py or
Outputs
{'best': 'Indicators -> Parameters with Decision Tree Regressor', 'pca': <class 'sklearn.decomposition._pca.PCA'>, 'regressor': <class 'sklearn.tree._classes.DecisionTreeRegressor'>, 'pca_n_components': 1, 'score_metric': 'R2', 'score_train': 0.8566862200815691, 'score_test': -0.6701488680815267, 'delay': 10}
Environment