[Discuss] set phases and estimate parameter values repeatly to improve accuracy

lisphilar commented 3 years ago

Summary of this new feature

At this time, users set phases with Scenario.trend(), estimate parameter values with Scenario.estimate() and show the results. Users can adjust phase setting mannually and try parameter estimation again. However, this is not automated and inefficient when we want to analyse records of many countries.

(Optional) Solution

New method Scenario.trend_estimate(model, timeout=120, max_iteration=3, **kwargs) performs the following steps as mentioned in #272 .

S-R trend analysis with all records
Parameter estimation in each phase with timeout
S-R trend analysis with the phases in which estimation failed
Parameter estimation in new phases with 'timeout`
Repeat.

Inglezos commented 3 years ago

How do we know when estimation has failed indeed? If we see these non-monotonic spikes in the simulated cases (for simulated confirmed/deaths/recovered cases) ? Before the estimator, during the trends analysis, couldn't we know that? I mean isn't there any sign of such troublesome behavior in the trends analysis results? Any kind of score about how much efficient the trends division of the timeseries into phases was?
What if then, during trends analysis, we would override the automatic process of ruptures phase division and subdivide the problematic phase into two phases (by telling ruptures to use one more point of change than the previous time) ? Could that solve the spikes issue?
Of course a manual phase setting feature would be useful, independently of the spikes problem.

lisphilar commented 3 years ago

How do we know when estimation has failed indeed? If we see these non-monotonic spikes in the simulated cases (for simulated confirmed/deaths/recovered cases) ?

We have created Scenario.score(). This will be used with user-defined EMSLE value. (RMSLE value shows robust.) Before we release the new stable version, it is necessary to find good default value with manual trials, Scenario.score(), Scenario.history("Infected") etc. Or, continue spliting phases and parameter estimation until RMSLE score shows stable values. i.e. When score in each trial is 0.3, 0.2, 0.1, 0.99, 0.99,..., we will stop after 5th trial.

Before the estimator, during the trends analysis, couldn't we know that?

It is desirable, but I do not have ideas currently. Please share your ideas.

What if then, during trends analysis, we would override the automatic process of ruptures phase division and subdivide the problematic phase into two phases (by telling ruptures to use one more point of change than the previous time) ? Could that solve the spikes issue?

I will try it to evaluate the solution.

Of course a manual phase setting feature would be useful, independently of the spikes problem.

We have manual tools (separate/combine/add/delete) as explained in https://lisphilar.github.io/covid19-sir/usage_phases.html

Inglezos commented 3 years ago

Regarding improvement of estimation accuracy. For estimating Rt I found this interesting article: https://www.datacamp.com/community/tutorials/replicating-in-r-covid19 which includes the original link: http://systrom.com/blog/the-metric-we-need-to-manage-covid-19/ and has the following notebook: https://github.com/k-sys/covid-19/blob/master/Realtime%20R0.ipynb

It states among other things:

(Rt can be estimated as) the number of people who become infected per infectious person at time 𝑡

which could mean for us Confirmed.diff()/Infected. Could it be that simple? Please take a look into this, it might be very useful for us to include this method in order to improve drastically the parameters estimation accuracy.

lisphilar commented 3 years ago

Thank you for your information!

Rt definition: Yes, but this does not mean Confirmed.diff()/Infected. This is because most of "Infected" cases are quarantined and do not infect susceptible people. df = jhu_data.subset("Japan"); (df["Confirmed"].diff()/df["Infected"]).mean() returns 0.09. Rt in Japan is around 1.0. To get Rt values with this definition directly, contact tracing data is required, but no countries open this dataset for OSS I think.

Rt estimation with poisson model: I saw many notebooks related to this approach in Feb2020 - May2020 (when Kaggle opened competitions regarding COVID-19), but they stopped updating with unknown reasons as far as I can see. However, it will be a great idea to compare Rt values estiamted with SIRs and that with poisson model. As the first step, could you try rewrite the notebook in GitHub you mentioned?

lisphilar commented 3 years ago

Repeated trend analysis and the approaches is discussed in #670.

lisphilar / covid19-sir

[Discuss] set phases and estimate parameter values repeatly to improve accuracy #354

Summary of this new feature

(Optional) Solution