[New] alternative approach of estimation: estimate tau and ODE parameters separately

lisphilar commented 3 years ago

Summary of question

We use hyperparameter optimization with Optuna to estimate ODE parameter values with the number of cases.

To improve accuracy and speed, is it possible to develop the following idea?

We know the number of days and initial/final values of the number of cases in each phase, but we do not know parameter values.

set tau value suggested by Optuna
create a dataset of parameters with random values
solve initial value problems of ODEs using this parameter dataset
learn the relationship of final values calculated by the previous step (X) and parameter values (y)
estimate parameter values with the known (actual) final values
repeat to determine tau and parameter values

It's just an idea, but could be a starting point of discussion to improve our approach. This is related to #698.

rebeccadavidsson commented 3 years ago

In step 4, what method would be used to learn the relationship of X and y? And could you explain step 5 with more detail? For step 5, are you pointing on training a model that could learn from the known actual final values and X?

lisphilar commented 3 years ago

For step 4 and 5, I considered to use regression models to learn the relationship of X and y. Then, regard the actual final values as X_target, and estimate y_target with the regression models. Yes, it may be too much to apply machine larning for this purpose.

lisphilar commented 3 years ago

Updated my first idea. For step 2-5 above, scipy.optimize.minimize could be used. Scipy estimate ODE parameter values and Optuna estimate tau value for the scenario.

When I created Estimator, I selected only Optuna, not Scipy, for parameter estimation because I did not have ideas to calculate initial guesses of kappa and theta. As I said in #698 and #711, initial guesses of kappa and theta can be "(dF/dt)/Infected" and 0 respectively.

I will create an experimental class for that this weekend to try this idea.

lisphilar commented 3 years ago

I tried scipy.optimize.minimize to minimize RMSLE score with optimizing parameter values, but parameter search did not end. I will try the following approach at this time with a new class.

Optimize tau value with all phases using initial guess of parameters, using Optuna.
Optuna optimize ODE parameters with selected tau value, (10%, 90%) quantiles of initial guess.

lisphilar commented 3 years ago

With #724, added experimental methods to ODEHandler class. .estimate() calls the following methods internally.

ODEHandler.estimate_tau(): Select tau value [min] which minimize the score of the metric. ODE parameter for each tau value will be guessed by .guess() classmethod of the model (cs.SIR, cs.SIRD, cs.SIRF).
ODEHnadler.estimate_params(): Estimate ODE parameter values of the all phases to minimize the score of the metric, using optuna.

rebeccadavidsson commented 3 years ago

To speed up the process, do you still begin with a range between 0 and 1 for kappa and theta? What happens if you narrow down that range to (0, 0.5) for example? And is it an option to run less trials? For example by checking if RMSLE did not decrease after n trials

lisphilar commented 3 years ago

The value range of kappa was updated from (0,1) to the range of (0.1, 0.9) quantiles of (dF/dt) / I with #711 and #698. Please refer to covsirphy.SIRF.param_range() (called by covsirphy.ode.param_range_estimator._ParamEstimator(),__init__) and covsirphy.SIRF.guess().

The value range of theta is still (0, 1). Yes, we should narrow down it for speed. Theta is near to 0, but how can we decide the range logically? Sorrry, the exact value range cannot be controlled by users at this time.

Options to run less trials are listed in the docstring of ODEHandler.estimate_params().

timeout (int): timeout of optimization
timeout_iteration (int): timeout of one iteration
tail_n (int): the number of iterations to decide whether score did not change for the last iterations

We controll the number of trials with timeout. The listed arguments are also effective in the current Scenario.estimate(). This calls covsirphy.simulation.estimator.Estimator.run() at version 2.19.1.

Inglezos commented 3 years ago

Just to have a clarified updated overview of this, when you have some time, could you please describe the new call tree of the Scenario.estimate()? To see how the ODE solver works now and how the parameters and tau are now calculated and in what order.

Because I got a little confused. Now is tau calculated first instead of last? Shouldn't all the 4 parameters be optimized at the same time? Can we optimize independently only one (out of these 4 parameters) since we do not know the others yet?

Inglezos commented 3 years ago

Also if I understand correctly, currently we assume that theta is fixed 0, not leniently around 0. But the purpose of SIRF (and on the contrary to the SIRD) was to capture such direct deaths and it is very possible these deaths to have been existed or to continue to exist. So we can't totally assume beforehand that theta is always 0. It is okay to use this approximation for calculating kappa's range, but in the .guess() it is wrong to set theta always to 0. We could limit down theta values somehow, close to 0 (but not fixed 0).

Nevertheless, on further thoughts, in the first place, I wonder if we should use the approximation "theta is 0" for kappa's formula at all. I know that it is convenient for extracting kappa formula but then we have to use 0 as theta value which is wrong as I explained and here is the dilemma.

These direct deaths mean that they happened too fast and these people while being sick didn't participate in the I compartment, which means that they didn't infect others. This scenario/case seems probable and thus we cannot set theta to 0 as a general rule. Is .guess() used in the current version or is experimental?

lisphilar commented 3 years ago

Currently, call trees of Scenario.estimate() and Scenario.simulate() are not updated. Please check & update the new strategy with the new experimental class ode.ode_handler.ODEHandler.

With #729, I updated the codes to fix some bugs when using the new class with actual records. This does not change the strategy, but please use version 2.19.1-beta-fu1.

import covsirphy as cs
cs.__version__
# Data
data_loader = cs.DataLoader()
jhu_data = data_loader.jhu()
population_data = data_loader.population()
country = "Italy"
records_df = jhu_data.records(country, population=population_data.value(country))[0]
# Get change points
trend_detector = cs.TrendDetector(records_df)
_, end_dates = trend_detector.sr().dates()
# Set-up handler
first_date = records_df.Date.min()
handler = cs.ODEHandler(cs.SIRF, first_date)
for date in end_dates:
    handler.add(date, y0_dict=records_df.set_index("Date").loc[date].to_dict())
# Estimate tau and parameter values
tau, info_dict = handler.estimate(records_df)
# Simulation with estimated tau and parameters
handler.simulate()

ODEHandler.estimate() calls the following methods internally.

ODEHandler.estimate_tau(): Select tau value [min] which minimize the total score of all phases. Theta/kappa/sigma/rho set for each tau value will be guessed by .guess() classmethod of the model (cs.SIR, cs.SIRD, cs.SIRF).
ODEHnadler.estimate_params(): Estimate theta/kappa/sigma/rho value set to minimize the score of each phase, using optuna and fixed tau value (determined by ODEHandler.estimate_tau() in advance). Value ranges of parameters are determined by .param_range(), not by .guess().

currently we assume that theta is fixed 0, not leniently around 0.

Theta is not fixed to 0 because we use param_range() for theta/kappa/sigma/rho estimation. .guess() is only for tau estimation now. The value range of theta is still (0, 1) for theta/kappa/sigma/rho estimation as I replied in my previous comment.

I know that it is convenient for extracting kappa formula but then we have to use 0 as theta value which is wrong as I explained and here is the dilemma.

Why we will have to use fixed 0 as theta when the approximation is applied for kappa value range calculation? The estimated kappa value is not always the same as "(dF/dt) / I". The value range of kappa is (10%, 90%) percentiles of "(dF/dt) / I". Thus, the value range of theta can be (0, 1) or around 0, not exact 0.

.guess() used in the current version or is experimental?

.guess() is only for tau estimation in experimental ODEHander.

Inglezos commented 3 years ago

Oh so you mean that all this ODEHandler module is only experimental for the moment? When the scenario analysis runs this never gets called yet?

Theta/kappa/sigma/rho set for each tau value will be guessed by .guess() classmethod of the model

Alright so I think what you mean is that during tau estimation in estimate_tau(), the params need to be calculated in a preliminary way necessarily which is done in _score_tau() -> info_dict[phase]["param"] = self._model.guess(df, tau). Then in estimate_params() they are calculated normally as they were supposed to, without use of guess() right? Yes this makes absolutely sense. So from this perspective the guess() function serves as a way to estimate the params only in order to find a proper tau value by optimizing the study from the scope of tau?

Why we will have to use fixed 0 as theta when the approximation is applied for kappa value range calculation? The estimated kappa value is not always the same as "(dF/dt) / I". The value range of kappa is (10%, 90%) percentiles of "(dF/dt) / I". Thus, the value range of theta can be (0, 1) or around 0, not exact 0.

In param_range() there is the line (*):
# kappa = (dF/dt) / I when theta -> 0
kappa_series = f.diff() / t.diff() / i
Using that formula we can solve for kappa. But this formula isn't valid only when theta = 0? Not in the general case where theta is (0,1) as we have. The original equation is: dF/dt = rho*theta*S*I + kappa*I. In the special case where theta=0 then solving for kappa gives us kappa = (dF/dt) / I.

When theta gets all the values in (0,1) the () does not apply. That's what I don't understand. The fact that then the final kappa gets values in the percentile range 10%-90% of ( ), is independent from the theta value discussion, this is a later step right? Or these percentiles were selected specifically to represent somehow theta? Because I think the percentiles are applied in order to exclude outliers.

Sorry for the many questions but I'm trying to understand :)

lisphilar commented 3 years ago

Oh so you mean that all this ODEHandler module is only experimental for the moment? When the scenario analysis runs this never gets called yet?

Yes, ODEHandler is not linked to Scenario class at this time. Later in this issue, this will be linked to Scenario, the interface for all users.

On a different note, in the old versions, we need to check all features of covsirphy library via analysis.Scenario class. This made it difficult to find errors because call trees are cross-sectional and complexed (parameter estimation: analysis.Scenario -> analysis.ParamTracker -> phase.PhaseSeries -> phase.MPEstimator > phase.PhaseUnit -> simulation.Estimator -> simulation.Simulator -> ode.SIRF). I'm trying to simplify the call trees by creating handling (user interface) classes for each package, like ODEHandler, TrendDetector, RegressionHandler and DataLoader. ODEHandler works independently without the other packages except for covsirphy.util. Now, ODEHandler perform parameter estimation by ode.ODEHandler -> ode. _ParamEstimator and ode._MultiPhaseODESolver -> ode._ODESolver.

Alright so I think what you mean is that ... So from this perspective the guess() function serves as a way to estimate the params only in order to find a proper tau value by optimizing the study from the scope of tau?

Yes, exactly.

Using that formula we can solve for kappa. But this formula isn't valid only when theta = 0? Not in the general case where theta is (0,1) as we have.

Indeed, kappa is not "equal" to "(dF/dt) / I" when theta -> 0. To be precise, when theta -> 0, kappa is near to "(dF/dt) / I".

We know theta is not equal to 0 because direct fatal cases are observed actually in COVID-19 outbreak, but this is near to 0 because the number of direct fatal cases is relatively smaller than in-direct fatal cases. So, kappa is near to "(dF/dt) / I".

When theta gets all the values in (0,1) the (*) does not apply.

Yes, it is more logical to narrow down the value range of theta as @rebeccadavidsson said in the previous comment so that we can describe the approximation logic with the codes (and improve performance). Do you have any ideas to narrow down it? I set (0, 1) for theta because I did not have logical ideas for that.

Inglezos commented 3 years ago

Hmm perhaps for theta range we could run some estimations for various countries and check theta what values normally gets. That would give us an idea for limiting the range I think. I would expect it to be less than for example 0.4 though. A safe first upper limit could be 0.5 because for sure indirect fatal cases are way more than direct ones.

lisphilar commented 3 years ago

It seems difficult to narrow down the value range of theta using dataset... We need to directly update cs.SIRF.param_range(), or to controll the limits of parameter values and percentiles with the arguments of ODEHandler.estimate_params(). (MEMO: for refactoring, I prefer to merge .param_range() and .guess() later because a part of codes are overlaped.)

lisphilar commented 3 years ago

With #734, for SIR, SIR-D, SIR-F model,

model.guess(data, tau, q=0.5) returns 0.5 quantiles of parameter values
model.guess(data, tau, q=(0.1, 0.9)) returns 0.1 and 0.9 quantiles of parameter values

model.param_range() will not be used in parameter estimation and replaced with model.guess(..., q=(0.1, 0.9)).

Note that cs.SIRF.guess(q=<float>)["theta"] is always 0.0 and cs.SIRF.guess(q=<tuple(<float>, <float>)>)["theta"] is always (0.0, 0.5), not (0.0, 1.0).

lisphilar commented 3 years ago

With my testing using ODEHandler, theta << 0.1.

Additional issue for ODEHandler: RMSLE score of each phase was in (0.001, 0.4) ([_dict["RMSLE"] for (phase, _dict) in info_dict.items()]), but cs.line_plot(handler.simulate().set_index("Date").drop("Susceptible", axis=1)) showed un-expected spikes at change points. Figure_1

Inglezos commented 3 years ago

model.param_range() will not be used in parameter estimation and replaced with model.guess(..., q=(0.1, 0.9)).

You mean in the future when the experimental class will be merged to scenario or this is already happening currently after this pull request you merged? guess() in sirf.py is used now for calculating the estimated parameters as well or only for tau as we had concluded two days ago?

lisphilar commented 3 years ago

Please refer the codes of SIRF.guess() and _ParamEstimator.__init__. When tau estimation, SIRF.guess(q=0.5) is called and floating value (0.5 quantile) will be returned. On the other hand, when ODE parameter estimation, SIRF.guess(q=(0.1, 0.5)) is called and tuple of foating values (01 quantile and 0.9 quantile) will be returned. The conclusion (specific value for tau estimation, value range for ODE parameter estimation) is not changed.

Inglezos commented 3 years ago

1.

On the other hand, when ODE parameter estimation, SIRF.guess(q=(0.1, 0.5)) is called and tuple of foating values (01 quantile and 0.9 quantile)

You meant here SIRF.guess(q=(0.1, 0.9)) right?

2. All these, guess(), estimate_params() and _ParamEstimator are inside ODEHandler module, which you said for the moment is experimental only. Also I see that in sirf.py the param_range() function is active and not deprecated yet.

Therefore with the current version, when I call Scenario.estimate(), which one is currently called, guess() or param_range()?

3. The line "theta": 0.0 if isinstance(q, float) else pd.Series([0.0, 0.5]).repeat([1, len(q) - 1]) inside guess() what values sets to theta? for q=float -> theta=0 while for q=tuple=(0.1, 0.9) for example -> theta=(0, 0.5)? The repeat() function means that q could contain more values i.e. q=(0.1, 0.5, 0.9)? Then theta what would be?

lisphilar commented 3 years ago

You meant here SIRF.guess(q=(0.1, 0.9)) right? All these, guess(), estimate_params() and _ParamEstimator are inside ODEHandler module, which you said for the moment is experimental only.

Yes.

Also I see that in sirf.py the param_range() function is active and not deprecated yet.

Because .param_range() is used in Scenario ->... -> Estimator, this is not deprecated yet. When this call tree will be repaced with ODEHandler, param_range() will be deprecated.

Therefore with the current version, when I call Scenario.estimate(), which one is currently called, guess() or param_range()?

param_range() is called because the current call tree is not changed.

The line "theta": 0.0 if isinstance(q, float) else pd.Series([0.0, 0.5]).repeat([1, len(q) - 1]) inside guess() what values sets to theta? for q=float -> theta=0 while for q=tuple=(0.1, 0.9) for example -> theta=(0, 0.5)?

Yes, q=float -> always 0, q=tuple -> always (0.0, 0.5)

The repeat() function means that q could contain more values i.e. q=(0.1, 0.5, 0.9)? Then theta what would be?

q=(0.1, 0.5, 0.9) -> always (0.0, 0.5, 0.5) q=(0.1, 0.5, 0.9, 1.0) -> always (0.0, 0.5, 0.5, 0.5)

Because pandas.Series.quantile() accepts q=(0.1, 0.5, 0.9, 1.0) etc., I made this specification. In ODEHander, as you know, q is or <tuple(float, float)>, not <tuple(float, float, float)>.

lisphilar commented 3 years ago

I said "ODEHandler showed un-expected spikes at change points", but this was due to my mistakes in tesing codes (calculation of initilal values when adding phases). ODEHandler works well.

import pandas as pd
import covsirphy as cs
cs.__version__
# Data
data_loader = cs.DataLoader()
jhu_data = data_loader.jhu()
population_data = data_loader.population()
country = "Italy"
records_df = jhu_data.records(country, population=population_data.value(country))[0]
# Get change points
trend_detector = cs.TrendDetector(records_df)
start_dates, end_dates = trend_detector.sr().dates()
# Set-up handler
first_date = records_df.Date.min()
handler = cs.ODEHandler(cs.SIRF, first_date)
for (start_date, end_date) in zip(start_dates, end_dates):
    y0_dict = records_df.set_index("Date").loc[start_date].to_dict()
    handler.add(end_date, y0_dict=y0_dict)

# Estimate tau and parameter values
tau, info_dict = handler.estimate(records_df)
# Simulation with estimated tau and parameters
df = handler.simulate()
title = f"{country}: simulated number of cases"
cs.line_plot(df.set_index("Date").drop("Susceptible", axis=1), title=title)
# Summary
pd.DataFrame.from_dict(info_dict, orient="index")

Figure_1

lisphilar commented 3 years ago

Becuase new ODEHandler works, I will try to update Scenario to use ODEHandler internally. Do you have any concerns?

Inglezos commented 3 years ago

Could you first try ODEHandler with a few other countries as well, just to be sure? I think Greece, USA, India, Brazil, Russia and Japan would do. But please plot confirmed as well in the estimation tests.

So after the merge/update, the Scenario analysis will do exactly the same at user/application level as it does currently right?

lisphilar commented 3 years ago

I tried ODEHandler for the countires with version 2.19.1-gamma-fu3 and Google Colab. It worked well I think. Please refer to this notebook. https://gist.github.com/lisphilar/14106fbf1ff0603a5efc27abe04146e7

Scenario will work as the same as it does currently except for Scenario.estimate_history() (pairplot to show the optimization history of Optuna) and Scenario.phase_estimator() (return Estimator class for a phase). They will be removed (raise NotImplementError) because they depend only on simulation.Estimator class.

My plan is as follows.

Will use ODEHandler internally:

[x] Scenario
[x] ExampleData
[x] ModelValidator

Will be deprecated:

[x] simulation.Estimator
[x] simulation.Simulator
[x] phase.MPEstimator
[x] phase.PhaseSeries
[x] phase.PhaseUnit
[x] analysis.ParamTracker: will be replaced with new ~~internal~~ class anlysis.PhaseTracker, directly called by Scenario for each scenario
[x] worldwide.PolicyMeasures: This class was created for forecasting, but Scenario.fit_predict() took over this role.
[x] Scenario.estimate_history()
[x] Scenario.phase_estimator()
[x] model.param_range(), model.specialize() and model.restore() for SIR, SIRD and SIRF

Inglezos commented 3 years ago

So I suppose that all the following top level functions remain unchanged at user level, right?

snl = cs.Scenario(jhu_data, population_data, country)
snl.register(extras=[oxcgrt_data])
snl.records()
snl.trend()
snl.estimate(cs.SIRF)
snl.add()
snl.fit_predict()
snl.simulate(name=...)
snl.history(target=...)

If yes, then we can merge it, I think the countries analysis is fine.

lisphilar commented 3 years ago

Yes, internal codes of them will be changed, but we will keep the usage (method names and arguments).

The plan has not been completed. I will try the plan in this week with pull request(s).

Inglezos commented 3 years ago

Just a quick question about the parameters, in Estimator._compare(), how do we know the actual parameters in order to compare them with the simulated/estimated ones? Do we calculate them somewhere else at a previous point? In param_range()/guess() we determine their range, not their actual values right? That comp_df dataframe in which function is assigned the actual parameters? [Update:] We compare cases not parameters, comment deleted.

lisphilar commented 3 years ago

Yes, we compare the number of cases (like Scenario.score()) for each phase.

lisphilar commented 3 years ago

Memo: New analysis.PhaseTracker will be a public class because Scenario.__get_item__ will return an instance of this object. I updated the to-do list.

Inglezos commented 3 years ago

Just two quick questions:

In param_range()/guess(), the input argument taufree_df/data respectively, is the complete data, from first to last date of pandemic or only of the current phase, and for the rest phases this function is called iteratively?
The difference between guess() call during tau estimation and normal parameters estimation, is that theta is fixed 0 in the first case, while in the second case theta is inside [0. 0.5] quantile range of theta series values?

lisphilar commented 3 years ago

Only the current phase because Estimator and _ParamEstimator are called with records for one phase.
theta is fixed 0 in the first case and theta is exact range (0, 0.5) in the second case. theta series is not used. (It could be used as well as kappa/sigma/rho, but I don't know how to get theta series.)

Inglezos commented 3 years ago

I used timepoints() to compare actual with forecast for example for Italy and I got these results: https://colab.research.google.com/gist/Inglezos/fd4199a7e6803bb75f8d7be9316a523d/regression_countries.ipynb

It is bothering me that the forecast deaths are far from the actual and it seems that the Main simulation is more close to actual. Is this expected or does it mean that the causality measures-cases is more complicated? It seems like death cases are overestimated during forecast. Is this related to theta, kappa range?

lisphilar commented 3 years ago

Hmm, it seems just a issue regarding forecasting, not parameter estimation because this is for future phases. theta and kappa were not predicted as expected with unknown reasons. We need to study the relationship of parameters and indicators. In advance, some tools #644 will be necessary.

[Additional] Did you checked test score with Scenario.fit()["score_test"] and coefficients?

Inglezos commented 3 years ago

I rerun the analysis locally and tried:

ita_fit_df = ita_scenario.fit_predict(oxcgrt_data=oxcgrt_data, delay=(7,45), metric="R2", limits=(3,50), name="Forecast")
ita_fit_dict_result = ita_fit_df._reghandler_dict["Forecast"].to_dict("R2")

and got:

What is the interpretation of this?

Also in another country analysis the train/test scores got negative at some point.

lisphilar commented 3 years ago

Generalization peformance seems low because there is uneven balance between training score and tes score. Could you investigate it in a new issue?

Strengency index has large impact on kappa because absolute value of coefficient is large. However, this index should be removed from indicators for prediction because this has strong relationships with the other indiators. I think this could be done automatically with Elastic Net, but this did not work for your result because l1_ratio and alpha was 0 (calculated by ElasticNetCV).

We need to investigate Elastic Net regression. Or, new machine learing models, including SVR, should be added as regressors.

lisphilar commented 3 years ago

Tasks completed for parameter estimation. Now Scenario uses ODEHandler internally via PhaseTracker.

lisphilar / covid19-sir

[New] alternative approach of estimation: estimate tau and ODE parameters separately #718

Summary of question