lisphilar / covid19-sir

CovsirPhy: Python library for COVID-19 analysis with phase-dependent SIR-derived ODE models.
https://lisphilar.github.io/covid19-sir/
Apache License 2.0
109 stars 44 forks source link

[Question] Theory behind ODE system solution, Optuna optimization and estimation #698

Closed Inglezos closed 3 years ago

Inglezos commented 3 years ago

Summary of question

I wanted to ask you about the theory behind the ODE system solution, it is something that I was wondering about these days, sorry if I make naive perhaps questions.

If I understand correctly, we have 4 unknown target variables, S, I, R, F. Also we have the a1, a2, β, γ (constants in a phase) so another 4 unknown parameters. And finally the unknown tau parameter. So I count 9 unknown entities (4 target variables + 5 constant parameters). And the ODE system consists of 4 equations.

  1. What function in the code solves this ODE system?
  2. What is the role of the optuna hyperparameter optimization, what does it handle, the 5 parameters? So Optuna tries consecutively specific values for these 5 parameters, then calls the above ODE solver and calculates the rmsle fitting error and then tries again with another parameter set, trying to minimize the rmsle error? Could you please give me a call tree for this whole procedure just to have an overview?
  3. Do you know how many timepoints (data records for the observed C, R, F per day) are needed theoretically (from ODE system math theory I mean) in order the system to converge to a stable solution? As many as the number of the target variables? Or perhaps even more because of the multiple unknown parameters? I haven't used ODE systems for a while now so I am kinda rusty.
  4. Do these timepoints determine only the target variables or can help to estimate the parameters as well? I mean, if we had 9 timepoints, equal to the total number of the unknown entities, could we lead immediately to a solution without using hyperparameter optimization or not? How the number of timepoints in theory affect the solution?

Sorry for all these questions, I needed to record them all together somewhere and I understand that it would be difficult for you to answer them all, but please take your time and just tell me your thoughts on the ones you don't know for sure, as a discussion. It would be very useful for me and will help me to unlock some ambiguous topics in the estimation process. In any case don't feel overwhelmed by these at all, I just thought to discuss these because it will be useful for us and all the users that read this to clarify the underlying theory behind the model estimation and fitting.

I stumbled upon these questions while dealing with the issue #693 and the min_size value.

lisphilar commented 3 years ago

Thank you for your questions! Basic concept was explained in my Kaggle notebook version 3 with codes (13Feb2020), but this is not documented at this time. They should be documented in the tutorials and glossary (TERM.md) page in CovisirPhy Documentation.

I will start my comments with the call tree.

We use two classes for parameter estimation.

Summary of steps: a. optuna suggests tau/theta/kappa/rho/sigma b. ->ODESimulator solves ODE equations c. -> sklearn calculate RMSLE score d. -> optuna suggest tau/theta/kappa/rho/sigma to minimize RMSLE score e. -> repeats

The details:

  1. We apply records (Date, Confirmed, I, R, F), SIRF and population value to Estimator()
  2. Estimator.__init__ sets the values on the start date as the initial values
  3. We call Estimator.run() and this method initiates a study of Optuna viaEstimator._init_study()`
  4. Optuna suggests tau value from divosors of 1440 in Estimator._objective()
  5. ModelBase.tau_free() divides the dates of the records by tau [min] for non-dimensionalization of time
  6. SIRF.param_range() set the value range of parameter values in Estimator._objective()
  7. Optuna suggests parameter values (theta, kappa, rho, sigma) with Estimator._suggest()
  8. ODESimulator class solve ODE equations with SIRF.__call__() and scipy.integrate.solve_ivp in ODESimulator._solve_ode()
  9. Estimator._rmsle() calculates RMSLE score with the actual records and simulated values using sklearn and ODESimulator
  10. if time-out or the simulated values are acceptable (Estimator._is_in_allowance(): max values of simulated number of cases are in the allowance) or RMSLE score was not improve in the last four trials, return parameter set
  11. if not, Optuna suggest tau and four parameter values to minimize RMSLE score, then repeat

Answers for the remained questions:

What function in the code solves this ODE system?

SIRF.__call__() and scipy.integrate.solve_ivp in ODESimulator._solve_ode().

What is the role of the optuna hyperparameter optimization, what does it handle, the 5 parameters?

To suggest tau/theta/kappa/rho/sigma to minimize RMSLE score.

So Optuna tries consecutively specific values for these 5 parameters, then calls the above ODE solver and calculates the rmsle fitting error and then tries again with another parameter set, trying to minimize the rmsle error?

Yes.

Do you know how many timepoints (data records for the observed C, R, F per day) are needed theoretically (from ODE system math theory I mean) in order the system to converge to a stable solution?

I do not have references actually, but, in my one opinion, records of 3 days could be.

Let's say we have some (x, y) records to estiamte (A, B) in Ax + By = 0 equation. We need two sets of (x, y) records to calculate (A, B).

Do these timepoints determine only the target variables or can help to estimate the parameters as well? I mean, if we had 9 timepoints, equal to the total number of the unknown entities, could we lead immediately to a solution without using hyperparameter optimization or not? How the number of timepoints in theory affect the solution?

With my last comment, immdiately solved when only 2 timepoints are available and tau is specified. It is unclear for me when tau is not specified.

Inglezos commented 3 years ago

Thank you very much for the detailed answers!! I think the process is way more clear for me now.

You mean that we need only data from two or max three days to be able to reach to a system solution. The additional timepoints (i.e. 7 or 30, whatever is the phase duration) only help Optuna to optimize the parameters by minimizing RMSLE for all the days of the phase and select one parameter set that represents all these days, but do not add anything to the ODE solver itself right? As long as there are at least these 3 records available to the ODE system, which is always valid since min_size was 5 and now 7.

Some additional questions:

  1. How is tau specified/calculated? How come it is always known/specified beforehand?
  2. I began to read the code from the call tree just to get a better overview. I stumbled upon stimulator.py line 199: df = df.apply(lambda x: x / sum(x), axis=1) in the non_dim() which confused me. If I remember correctly, the dimensionless variables (x,y,z,w) are given by (S,I,R,F) = N * (x,y,z,w). So in order to get them, we need to divide the taufree() dataframe columns by N, the total country population. But why do we divide with sum(x), which is the sum of each day's records respectively? What am I missing?
lisphilar commented 3 years ago

You mean that we need only data from two or max three days to be able to reach to a system solution.

Theoritically, yes. I saw some notebooks calculate reproduction number of SIR model on each date using "Rt = beta/gamma = - (dS/dt) / S / I / (dR/dt) * I". I called it "time-dependent SIR". However, to calculate parameter values (directly beta etc., not reproduction number here), I selected "phase-dependent" approach because periodicity and too large fluctuations made it difficult. For example, in Japan, we see Mon < Tue <= Wed <= Thr < Fri >= Sat > Sun in almost every week due to testing capacity. Phase-dependent approach enables us to ignore the fluctuations and to get physical meanings of parameter values (e.g. large rho indicates people meet each other frequently without masks).

How is tau specified/calculated? How come it is always known/specified beforehand?

"Tau is not set to a predetermined value. (With covsirphy.phase.phase_estimator.MPEstimator,) its value is determined by the estimator only during the last phase and then uses that value for all the previous phases as well" (because tau value determines how time passes and is expected to be stable in all phases). This is from "tau parameter in models" of glosary page and discussion #392 with your pull request :-O https://lisphilar.github.io/covid19-sir/markdown/TERM.html

(x,y,z,w) are given by (S,I,R,F) = N * (x,y,z,w). So in order to get them, we need to divide the taufree() dataframe columns by N, the total country population. But why do we divide with sum(x), which is the sum of each day's records respectively?

Inglezos commented 3 years ago

Oh yes we had discussed again about tau haha! So tau is the last parameter that is estimated by Optuna. Until the last phase, Optuna optimizes all the other 4 excluding tau as unknown, can it do that? I mean how can the error function be minimized if not all the variables are set to their trial values? Can you give me perhaps a call tree for tau usage across the phases?

If we have a stable solution with only 3 timepoints/data records, then could we set all the phases initially to 3 days and then merge similar phases in a bottom-up way? What is the point of using min_size = 7 or more? Maybe that's why the model validator concluded to 3 being a good value for rmsle..

parameter estimation with fully non-dimensionalized model has dis-advantages as I noted in..

Oh I wasn't aware of those past problems. So .non_dim() is not used. But I don't understand that line of code (out of curiosity): df = df.apply(lambda x: x / sum(x), axis=1) I suppose df contains as columns the S,I,R,F. The apply.lambda(..) for axis=1 takes each column (all column records) and divides it with its sum, right? So it's like we do S/sum(S), I/sum(I), R/sum(R), F/(sum(F)? That's what I don't get.

lisphilar commented 3 years ago

I mean how can the error function be minimized if not all the variables are set to their trial values? Can you give me perhaps a call tree for tau usage across the phases?

Internal process of Estimator? Or, Scenario -> Estimator process?

could we set all the phases initially to 3 days and then merge similar phases in a bottom-up way?

Note that three-days phases are meaningful only when these phases do not include fluctuations. To remove misleading worsened scores (sorry, I could not find right term) when merging phases, the initial length of phases should be longer than 3. Perhaps 7 could be selected at this time.

df = df.apply(lambda x: x / sum(x), axis=1)

axis=1 is sometimes confusing, but this means S/sum(S, I, R, F), I/sum(S, I, R, F) and so on. Please run the following codes. df.sum(axis=1) calculates total value in each row.

import pandas as pd
df = pd.DataFrame(
    {
        "A": [1, 2, 3],
        "B": [0.1, 0.2, 0.3],
        "C": [0.01, 0.02, 0.03]
    }
)
df.sum(axis=1).tolist()
# -> [1.11, 2.22, 3.3299999999999996]: A+B+C in each row
df.apply(lambda x: sum(x), axis=1).tolist()
# -> [1.11, 2.22, 3.3299999999999996]: A+B+C in each row

On a separate note, JHUData calculates "Susceptible" and Estimator re-calculated "Susceptible" using SIRF.specialize() internally. I will refactor them later.

lisphilar commented 3 years ago

I created draft pull request #706. Value range of kappa of SIR-F model will be updated from (0, 1) to (30%, 70%) quantiles of "Fatal.diff() / Time.diff() / Infected". This is because "kappa = (dF/dt) / I" when theta is around 0.

Inglezos commented 3 years ago

I don't understand, can you please explain why this is valid? Yes kappa = (dF/dt) / I when theta is around 0, but how the value range is associated for the case where theta=0 and why these specific percentages?

lisphilar commented 3 years ago

Optuna searchs the best parameter values (all of theta/kappa/rho/sigma) with the value range returned by SIRF.param_range() classmethod. At the current version, the range of theta/kappa is (0, 1). (30%, 70%) percentiles of - n * (dS/dt) / S / I is for rho, (30%, 70%) percentiles of (dR/dt) / I is for sigma. I determined 30% and 70% by testing at very old versions, and I will update the pull request to controll these percentages with Scenario.estimate() via Estimator.run() later.

Because "S* -> F" has very low probability of occurrence, we can assume theta is around 0. i.e. kappa is near to (dF/dt) / I.

lisphilar commented 3 years ago

I updated pull request #706. With this pull request, we can change the quantiles with Scenario.estimate(quantiles=(0.3, 0.7)) (as default at this time).

I tested some values for Italy. With my local PC and example/scenario.analysis.py, around (0.1, 0.9) seems to be better for Italy at this time.

Inglezos commented 3 years ago

Would it be wrong if we loosened up the range and just let them be (0,1)? And then it is of course up to the optimizer to choose the right value considering error. What is the benefit of the percentile restriction, shorter runtime? Because as we see this is not the case now. I mean initially what made you think of this restriction?

lisphilar commented 3 years ago

I did not tried optimization without restrictions at newer versions, but I think Optuna can find the right parameter values easily if the value range is narrow (yes, too narrow is not good). Also, when the value range is wide, Optuna may find some parameter sets which have lower RMSLE but are un-reasonable with the data.

lisphilar commented 3 years ago

I updated the pull request.

lisphilar commented 3 years ago

Pull request #706 also fixes negative Rt issue #668 in some countires, including Greece/Italy/Japan, by improvement of parameter estimation. Can we merge it to release 2.19.0 this week?

Inglezos commented 3 years ago

Could you please check also for USA, Poland and Russia? I think these had major forecast problems as well..

lisphilar commented 3 years ago

Negative reproduction number discussed in #668 will be removed in Italy/Greece/Japan/Poland/Russia data. All results (Scenario.summary()) with pull request #706 are here. https://gist.github.com/lisphilar/5ee175940557316e6baa1ccf275892ca

RMSLE score with USA data will be improved (0.27 -> 0.24), but issue #668 with USA will not be fixed. (RMSLE scores in the 0th/1st phase is large, 0.79, 0.24 respectively. Is this related?)

I will merge the pull request with issue #711 because parameter estimation will be improved. Discussion in #668 is still necessary to remove negative reproduction number in USA.

lisphilar commented 3 years ago

To simplify parameter estimation, do you have any ideas to determine tau value without Estimator? (i.e. Is it possible to estimate indipendently the best tau value for a country and ODE parameter values for phases?)

lisphilar commented 3 years ago

With pull request #720, to fix the last date of simulation results and to simplify the call tree of simulation, I created ode.ode_solver._ODESolver (internal only) and ode.ode_datahandler.ODEHandler (public) class. They are not used in Scenario class at this time. Call tree of Scenario.simulate() will be changed later and ODESimulator will be deprecated. Please check the usage of these classes with tests/test_ode.py::TestODEHandler.

lisphilar commented 3 years ago

This dicussion will be continued in #718. Call tree may be changed there.