lisphilar / covid19-sir

CovsirPhy: Python library for COVID-19 analysis with phase-dependent SIR-derived ODE models.
https://lisphilar.github.io/covid19-sir/
Apache License 2.0
110 stars 44 forks source link

[New] select algorithms of forecasting #795

Closed lisphilar closed 3 years ago

lisphilar commented 3 years ago

Summary of this new feature

We use Elastic Net and decision tree regression when forecasting and the best algorithm will be selected automatically, reffering test scores now.

With new enhancement, we can specify the algorithm to use by an argument of Scenario.fit(). This leads new features to investigate rationale of forecasting by hand.

Inglezos commented 3 years ago

Another good algorithm for forecasting that can be deployed is SVR along with MultiOutputRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html

I tried this locally in order to get more accurate results for my thesis purposes and if we tweak the kernel and epsilon parameters, we can obtain very good approximations. I had to manually set these values for the time being (tight deadlines) by trial-and-error for each country, but these two parameters could be further optimized for example with optuna or even with grid search, especially for the epsilon.

lisphilar commented 3 years ago

Yes, SVR can be a new regressor. Could you share your process of trual-and-error (or create pull requests if you have time)? With GridSearchCV/RandomSearchCV or something like that, CovsirPhy will find epsilon and the other parameters of SVR automatically.

lisphilar commented 3 years ago

One question, how did you evaluated the accuracy of approximations?

Inglezos commented 3 years ago

I used as today 14Apr21 and last date 21Apr21 and then I compared the forecasted/predicted cases against the actual for the last week. With trial and error I mean I tried some epsilon values per country until the forecast approached the actual ones quite well as proof of concept of a future SVR implementation after the grid search integration.

lisphilar commented 3 years ago

I used as today 14Apr21 and last date 21Apr21 and then I compared the forecasted/predicted cases against the actual for the last week.

Suitable. At this, test dataset is shuffled by train_test_split(shuffle=True). This value should be changed to False so that we can use records of continuous days as test data.

With trial and error I mean I tried some epsilon values per country until the forecast approached the actual ones quite well as proof of concept of a future SVR implementation after the grid search integration.

Could you share the exact values of epsilon you used to determine the candidates of epsilon in grid search?

Inglezos commented 3 years ago

I ran manually for each country the forecast with a test SVR model using sklearn-SVR and MultiOutputRegressor (by dummy modifying ElasticNet file py and I didn't implement a grid search feature for epsilon).

Probably these change from computer to computer as I noticed different estimated parameters during `estimate()' between my laptop and Google Colab for example, which difference depends on the cores and processor I guess. By default I didn't use the grid option for the delay period but some countries needed it for more accuracy, so this was another parameter for trial-error that I also tried manually.

Also, for some countries where the forecast deviated much from the actual, I tried to reduce the theta range and then reran the estimation, which gave better approximation for fatal cases and infected mostly if I remember correctly. All these variables could be part of a grid search or Optuna-study hyperparameter optimization problem.

I don't know if decision trees or RNNs are more efficient though, but in any case at the time I was trying these the only regressor that existed in CovsirPhy was ElasticNet and that's why I tried SVR hoping for better results.

I chose the following configurations:

  1. greece -> delay=(7,31) -> theta = [0.0, 0.5] -> kernel="rbf", C=10, gamma=0.1, epsilon=0.065
  2. china -> delay=(7,31) -> theta = [0.0, 0.1] -> kernel="linear", C=1, gamma=0.1, epsilon=0.000001
  3. australia -> not grid delay -> theta = [0.0, 0.1] -> kernel="rbf", C=100, gamma=0.1, epsilon=0.01
  4. austria -> not grid delay -> theta = [0.0, 0.5] -> kernel="rbf", C=100, gamma=0.1, epsilon=0.1 or epsilon=0.02
  5. brazil -> not grid delay -> theta = [0.0, 0.5] -> kernel="rbf", C=100, gamma=0.1, epsilon=0.01
  6. czech republic -> delay=(7,31) -> theta = [0.0, 0.5] -> kernel="rbf", C=100, gamma=0.1, epsilon=0.001
  7. germany -> not grid delay -> theta = [0.0, 0.1] -> kernel="linear", C=100, gamma=0.01, epsilon=0.1
  8. france -> delay=(7,31) -> theta = [0.0, 0.5] -> kernel="rbf", C=0.1, gamma=0.1, epsilon=0.1
  9. hungary -> not grid delay -> theta = [0.0, 0.001] -> kernel="rbf", C=100, gamma=0.1, epsilon=0.01-0.0125 or ~0.035
  10. india -> delay=(7,31) -> theta = [0.0, 0.5] -> kernel="rbf", C=100, gamma=0.1, epsilon=0.05
  11. italy -> delay=(7,31) -> theta = [0.0, 0.5] -> kernel="rbf", C=100, gamma=0.1, epsilon=0.04
  12. japan -> -> not grid delay -> theta = [0.0, 0.1] -> kernel="linear", C=100, gamma=0.1, epsilon=0.05
  13. netherlands -> not grid delay -> theta = [0.0, 0.5] -> kernel="linear", C=100, gamma=0.1, epsilon=0.0175
  14. sweden -> not grid delay -> theta = [0.0, 0.001] -> kernel="rbf", C=100, gamma=0.1, epsilon=0.05
  15. united kingdom -> delay=(7,31) -> theta = [0.0, 0.1] -> kernel="linear", C=100, gamma=0.1, epsilon=0.01
  16. usa -> delay=(7,31) -> theta = [0.0, 0.5] -> kernel="rbf", C=100, gamma=0.1, epsilon=0.01
  17. poland -> not grid delay -> theta = [0.0, 0.5] -> kernel="rbf", C=100, gamma=0.1, epsilon=0.05
  18. russia -> not grid delay -> theta = [0.0, 0.5] -> kernel="linear", C=100, gamma=0.1, epsilon=0.0125 or 0.02
  19. south africa -> not grid delay -> theta = [0.0, 0.5] -> kernel="rbf", C=100, gamma=0.1, epsilon=0.001
  20. spain -> not grid delay -> theta = [0.0, 0.5] -> kernel="linear", C=100, gamma=0.1, epsilon=0.01
  21. switzerland -> not grid delay -> theta = [0.0, 0.1] -> kernel="rbf", C=100, gamma=0.1, epsilon=0.05

(Due to the deadlines for the thesis I needed to stop side development features or fully implementation of a general mechanism and tried locally to test such an SVR model however I could in order to get the results)

Inglezos commented 3 years ago

Also, if you would like and have some time, please have a look at my thesis (the latest version at least until now, probably the final one), I hope I have included the citation for CovsirPhy correctly. https://drive.google.com/file/d/14OOV1q_wxLupGlXx6cSBDP2dnVURm6wQ/view?usp=sharing

Unfortunately it is in Greek though.

After I finish it and submit it (sometime in July), and when I will have free time after that, I will try to continue participating in the project.

lisphilar commented 3 years ago

Thank you for sharing your thesis, excellent work! I read it, translating it with Google. With your hard work to analyse data of many countries, ideas and discussions are well explained. If English version will be available, it could be a paper of CovsirPhy community. (As a thesis, to emphasize your work, it may be better to separate the contents of the Kaggle Notebook and your ideas/discussions/implements.)

Your participation is always welcomed in this project :-)

Also, I confirmed that parameter-tuned SVR has advantages. I will implement an feature later so that we can use SVR in CovsirPhy. At the latest version 2.20.2, decision tree regressor is available, that was not included in 2.19.0.

By default I didn't use the grid option for the delay period but some countries needed it for more accuracy, so this was another parameter for trial-error that I also tried manually.

We need to investigate the tendency of prediction accuracy when changing delay period. Lower value of delay may be better to get better accuracy at simulation level (Confirmed/Fatal/Recovered), but sometimes returns low test score.

Probably these change from computer to computer as I noticed different estimated parameters during `estimate()' between my laptop and Google Colab for example, which difference depends on the cores and processor I guess.

Yes, accuracy/speed of parameter estimation depends on the performance of processors. Improvement of the estimation algorithm is still an ongoing isssue in our project. As you mentioned, the value range of theta could be changed from (0, 0.5).

lisphilar commented 3 years ago

SVR was implemented with #803. The details could be revised later.

lisphilar commented 3 years ago

@Inglezos , Was feature selecion (or PCA) done just before SVR?

Inglezos commented 3 years ago

No I didn't use PCA before SVR

lisphilar commented 3 years ago

We may consider to use PCA and so on to avoid over fitting.

lisphilar commented 3 years ago

With #840, users can specify regressors with regressors argument of Scenario.fit() and abbreviations of regressor names. When regreesors=["en", "svr"], only Elastic Net and SVR will be used. As default, regressors=None and we use all registered regressors.

lisphilar commented 3 years ago

Improvement of regression and parameter estimation will be discussed in new issues.