Closed lisphilar closed 3 years ago
Another good algorithm for forecasting that can be deployed is SVR along with MultiOutputRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html
I tried this locally in order to get more accurate results for my thesis purposes and if we tweak the kernel
and epsilon
parameters, we can obtain very good approximations. I had to manually set these values for the time being (tight deadlines) by trial-and-error for each country, but these two parameters could be further optimized for example with optuna or even with grid search, especially for the epsilon
.
Yes, SVR can be a new regressor.
Could you share your process of trual-and-error (or create pull requests if you have time)?
With GridSearchCV
/RandomSearchCV
or something like that, CovsirPhy will find epsilon
and the other parameters of SVR automatically.
One question, how did you evaluated the accuracy of approximations?
I used as today 14Apr21 and last date 21Apr21 and then I compared the forecasted/predicted cases against the actual for the last week. With trial and error I mean I tried some epsilon values per country until the forecast approached the actual ones quite well as proof of concept of a future SVR implementation after the grid search integration.
I used as today 14Apr21 and last date 21Apr21 and then I compared the forecasted/predicted cases against the actual for the last week.
Suitable. At this, test dataset is shuffled by train_test_split(shuffle=True)
. This value should be changed to False
so that we can use records of continuous days as test data.
With trial and error I mean I tried some epsilon values per country until the forecast approached the actual ones quite well as proof of concept of a future SVR implementation after the grid search integration.
Could you share the exact values of epsilon
you used to determine the candidates of epsilon
in grid search?
I ran manually for each country the forecast with a test SVR model using sklearn-SVR and MultiOutputRegressor (by dummy modifying ElasticNet file py and I didn't implement a grid search feature for epsilon).
Probably these change from computer to computer as I noticed different estimated parameters during `estimate()' between my laptop and Google Colab for example, which difference depends on the cores and processor I guess. By default I didn't use the grid option for the delay period but some countries needed it for more accuracy, so this was another parameter for trial-error that I also tried manually.
Also, for some countries where the forecast deviated much from the actual, I tried to reduce the theta range and then reran the estimation, which gave better approximation for fatal cases and infected mostly if I remember correctly. All these variables could be part of a grid search or Optuna-study hyperparameter optimization problem.
I don't know if decision trees or RNNs are more efficient though, but in any case at the time I was trying these the only regressor that existed in CovsirPhy was ElasticNet and that's why I tried SVR hoping for better results.
I chose the following configurations:
(Due to the deadlines for the thesis I needed to stop side development features or fully implementation of a general mechanism and tried locally to test such an SVR model however I could in order to get the results)
Also, if you would like and have some time, please have a look at my thesis (the latest version at least until now, probably the final one), I hope I have included the citation for CovsirPhy correctly. https://drive.google.com/file/d/14OOV1q_wxLupGlXx6cSBDP2dnVURm6wQ/view?usp=sharing
Unfortunately it is in Greek though.
After I finish it and submit it (sometime in July), and when I will have free time after that, I will try to continue participating in the project.
Thank you for sharing your thesis, excellent work! I read it, translating it with Google. With your hard work to analyse data of many countries, ideas and discussions are well explained. If English version will be available, it could be a paper of CovsirPhy community. (As a thesis, to emphasize your work, it may be better to separate the contents of the Kaggle Notebook and your ideas/discussions/implements.)
Your participation is always welcomed in this project :-)
Also, I confirmed that parameter-tuned SVR has advantages. I will implement an feature later so that we can use SVR in CovsirPhy. At the latest version 2.20.2, decision tree regressor is available, that was not included in 2.19.0.
By default I didn't use the grid option for the delay period but some countries needed it for more accuracy, so this was another parameter for trial-error that I also tried manually.
We need to investigate the tendency of prediction accuracy when changing delay period. Lower value of delay may be better to get better accuracy at simulation level (Confirmed/Fatal/Recovered), but sometimes returns low test score.
Probably these change from computer to computer as I noticed different estimated parameters during `estimate()' between my laptop and Google Colab for example, which difference depends on the cores and processor I guess.
Yes, accuracy/speed of parameter estimation depends on the performance of processors. Improvement of the estimation algorithm is still an ongoing isssue in our project. As you mentioned, the value range of theta could be changed from (0, 0.5).
SVR was implemented with #803. The details could be revised later.
@Inglezos , Was feature selecion (or PCA) done just before SVR?
No I didn't use PCA before SVR
We may consider to use PCA and so on to avoid over fitting.
With #840, users can specify regressors with regressors
argument of Scenario.fit()
and abbreviations of regressor names. When regreesors=["en", "svr"]
, only Elastic Net and SVR will be used. As default, regressors=None
and we use all registered regressors.
Improvement of regression and parameter estimation will be discussed in new issues.
Summary of this new feature
We use Elastic Net and decision tree regression when forecasting and the best algorithm will be selected automatically, reffering test scores now.
With new enhancement, we can specify the algorithm to use by an argument of
Scenario.fit()
. This leads new features to investigate rationale of forecasting by hand.