Germany analysis - weird infected cases and predictions

Inglezos commented 3 years ago

Summary

For Germany, the infected cases as parsed from the dataset are for some reason totally wrong and thus the predictions are way off. The code I run is:

Codes and outputs:

   deu_scenario = cs.Scenario(jhu_data, pop_data, "Germany")
   deu_scenario.records().tail()
   _ = deu_scenario.trend()
   deu_scenario.summary()

germany_source_data_graph germany_trends_graph

   deu_scenario.estimate(cs.SIRF)
   deu_scenario.summary()
   _ = deu_scenario.history_rate()
   deu_scenario.clear()
   deu_scenario.add(days=7)
   deu_pred = deu_scenario.simulate() -> observe deu_pred

germany_weird_predictions_graph

[UPDATE 27Oct20]: Scenario.trends() does not work at all for Germany, see runtime error at #282.

Environment

CovsirPhy version: 2.9.1
Python version; 3.8
Installation: Anaconda, Spyder, pip
OS: Windows

lisphilar commented 3 years ago

After fix the runtime error in #284, I ran the codes in https://github.com/lisphilar/covid19-sir/blob/master/example/scenario_analysis_deu.py

Note in trend analysis: deu_trend

In the last phase, Susceptible is sharply decreasing while Recovered is increasing slightly. Scenario.trend() did not detected the change point inside the last phase.

Scenario.trend() has min_size argument (default: 7) and Scenario.trend(min_size=5) detected the internal change point in the phase. min_size is the minimam size of phases.

deu_trend

Note in parameter estimation: Then, a part of the summary was as follows (when min_size=7).

Start	End	RMSLE	Trials	Runtime
24-Jan-20	20-Mar-20	7.249912	298	1 min 2 sec
21-Mar-20	1-Apr-20	1.237492	207	1 min 6 sec
2-Apr-20	17-Apr-20	1.181668	233	1 min 8 sec
18-Apr-20	10-May-20	0.801548	449	1 min 1 sec
11-May-20	11-Jul-20	0.598836	435	1 min 2 sec
12-Jul-20	16-Aug-20	0.403211	464	1 min 1 sec
17-Aug-20	7-Sep-20	0.118004	584	1 min 1 sec
8-Sep-20	23-Sep-20	0.068422	628	1 min 1 sec
24-Sep-20	6-Oct-20	1.994347	1709	1 min 0 sec
7-Oct-20	27-Oct-20	0.147297	1047	1 min 3 sec

Iteration of parameter estimation ended with time limit (default: 60 sec + alpha) before improve RMSLE scores sufficiently. Please try to prolong the time limit with Scenario.estimate(cs.SIRF, timeout=120) etc.

We need to document the arguments outside the API reference and improve CovsirPhy. Scenario.estimate(cs.SIRF, timeout=120)depends onEstimator` class in https://github.com/lisphilar/covid19-sir/blob/master/covsirphy/simulation/estimator.py

Inglezos commented 3 years ago

So, for Germany, the trends analysis did not detect the change point inside the last phase. During that, you say min_size was at the default value of 7. But if you changed that to a smaller value, 5, then that change point was detected? How did that happen? What does min_size value mean, how is this used?
But isn't this value something that the trends class should automatically optimize?
So, what we have to do to solve all these issues, is to find the root cause at the Estimator's class and improve it? I think that the cause is linked directly with the other bug of #282 issue for the trends/ChangeFinder class, because how effectively the trends analysis is done affects directly the estimator. What is the call tree for the estimator?
Maybe you should pin 274 and 282? I think currently these are critical bugs that need to be solved immediately. Or create one or two more general open issues with the specific details of the bugs and the reworks need to be done. One regarding the Estimator class for the parameters estimation's RMSLE scores and one regarding the trends analysis with the ChangeFinder class and the trends' RMSLE scores.

lisphilar commented 3 years ago

During that, you say min_size was at the default value of 7. But if you changed that to a smaller value, 5, then that change point was detected? How did that happen? What does min_size value mean, how is this used?

min_size is the minimam size of a phase as I mentioned. ChangeFinder.run() splits the series of dates with ruptures package (discussed in #3 ) and combines short phases where the length is smaller than min_size. When created S-R trend analysis, I set the default value as 7 to avoid over-fitting (data of the number of cases has noize) and cut total runtime of paramerter estimation. As long as the value is over 2, we can change the default value for our analysis.

But isn't this value something that the trends class should automatically optimize?

Yes it should do, but how will the class do? This means we need to create an algorithm to keep good balance of accuracy, over-fitting (strickly speaking, this term is incorrect because we do yperparameter estimation, not prediction) and runtime.

So, what we have to do to solve all these issues, is to find the root cause at the Estimator's class and improve it? I think that the cause is linked directly with the other bug of #282 issue for the trends/ChangeFinder class, because how effectively the trends analysis is done affects directly the estimator. What is the call tree for the estimator?

This will be discussed in #291

Maybe you should pin 274 and 282? I think currently these are critical bugs that need to be solved immediately. Or create one or two more general open issues with the specific details of the bugs and the reworks need to be done. One regarding the Estimator class for the parameters estimation's RMSLE scores and one regarding the trends analysis with the ChangeFinder class and the trends' RMSLE scores.

We will use #274 and #291 with high-priority.

lisphilar commented 3 years ago

Note: This is not for the issue you reported in the first comment of this issue page, but weired number of recovered cases leads failures of S-R trend analysis. We should solve the issues regading weired data in some countries, including Greece.

lisphilar commented 3 years ago

With version 2.10.0-mu, the output is https://gist.github.com/lisphilar/47cb6e26a1fa3ee488aa6acbb1a4fb41#file-germany_14nov2020-ipynb

This issue is onging for Germany.

Inglezos commented 3 years ago

The simulated cases for Germany are quite accurate for the previous days. Nevertheless, the curves seem to be problematic. As I notice, the source data are weird. Specifically, the infected cases. Until mid-October they are not even 100 !?? I don't know how we can handle something like this or if we should bother at all. But what we must do is the following: I noticed that for Germany (I don't know if this happens for other countries as well), the records begin from January 3rd! This is impossible, considering that the first cases appeared in China on January 22! We must put a lower limit for the first cases date and use China's. Or a more advanced method would sum all the cases prior to 22Jan and use this sum as the first cases record.

lisphilar commented 3 years ago

Can we close this issue with the latest development version and #339 so that we can release stable version 2.13.0?

Inglezos commented 3 years ago

Yes we can close this issue now.

lisphilar / covid19-sir