Closed Inglezos closed 3 years ago
Thank you for your questions! Basic concept was explained in my Kaggle notebook version 3 with codes (13Feb2020), but this is not documented at this time. They should be documented in the tutorials and glossary (TERM.md) page in CovisirPhy Documentation.
I will start my comments with the call tree.
We use two classes for parameter estimation.
covsirphy.ode.mbase.ModelBase
: base class for ODE model descriptionscovsirphy.ode.sirf.SIRF
: define SIR-F modelcovsirphy.simulation.simulator.ODESimulator
: solve ODE equations for numerical simulationcovsirphy.simulation.estimator.Estimator
: estimate parameter values with try-and-errorSummary of steps:
a. optuna
suggests tau/theta/kappa/rho/sigma
b. ->ODESimulator
solves ODE equations
c. -> sklearn
calculate RMSLE score
d. -> optuna
suggest tau/theta/kappa/rho/sigma to minimize RMSLE score
e. -> repeats
The details:
SIRF
and population value to Estimator()
Estimator.__init__
sets the values on the start date as the initial valuesEstimator.run() and this method initiates a study of Optuna via
Estimator._init_study()`Estimator._objective()
ModelBase.tau_free()
divides the dates of the records by tau [min] for non-dimensionalization of timeSIRF.param_range()
set the value range of parameter values in Estimator._objective()
Estimator._suggest()
ODESimulator
class solve ODE equations with SIRF.__call__()
and scipy.integrate.solve_ivp
in ODESimulator._solve_ode()
Estimator._rmsle()
calculates RMSLE score with the actual records and simulated values using sklearn
and ODESimulator
Estimator._is_in_allowance()
: max values of simulated number of cases are in the allowance) or RMSLE score was not improve in the last four trials, return parameter setAnswers for the remained questions:
What function in the code solves this ODE system?
SIRF.__call__()
and scipy.integrate.solve_ivp
in ODESimulator._solve_ode()
.
What is the role of the optuna hyperparameter optimization, what does it handle, the 5 parameters?
To suggest tau/theta/kappa/rho/sigma to minimize RMSLE score.
So Optuna tries consecutively specific values for these 5 parameters, then calls the above ODE solver and calculates the rmsle fitting error and then tries again with another parameter set, trying to minimize the rmsle error?
Yes.
Do you know how many timepoints (data records for the observed C, R, F per day) are needed theoretically (from ODE system math theory I mean) in order the system to converge to a stable solution?
I do not have references actually, but, in my one opinion, records of 3 days could be.
Let's say we have some (x, y) records to estiamte (A, B) in Ax + By = 0 equation. We need two sets of (x, y) records to calculate (A, B).
Do these timepoints determine only the target variables or can help to estimate the parameters as well? I mean, if we had 9 timepoints, equal to the total number of the unknown entities, could we lead immediately to a solution without using hyperparameter optimization or not? How the number of timepoints in theory affect the solution?
With my last comment, immdiately solved when only 2 timepoints are available and tau is specified. It is unclear for me when tau is not specified.
Thank you very much for the detailed answers!! I think the process is way more clear for me now.
You mean that we need only data from two or max three days to be able to reach to a system solution. The additional timepoints (i.e. 7 or 30, whatever is the phase duration) only help Optuna to optimize the parameters by minimizing RMSLE for all the days of the phase and select one parameter set that represents all these days, but do not add anything to the ODE solver itself right? As long as there are at least these 3 records available to the ODE system, which is always valid since min_size
was 5 and now 7.
Some additional questions:
stimulator.py
line 199: df = df.apply(lambda x: x / sum(x), axis=1)
in the non_dim()
which confused me. If I remember correctly, the dimensionless variables (x,y,z,w) are given by (S,I,R,F) = N * (x,y,z,w). So in order to get them, we need to divide the taufree()
dataframe columns by N, the total country population. But why do we divide with sum(x)
, which is the sum of each day's records respectively? What am I missing?You mean that we need only data from two or max three days to be able to reach to a system solution.
Theoritically, yes. I saw some notebooks calculate reproduction number of SIR model on each date using "Rt = beta/gamma = - (dS/dt) / S / I / (dR/dt) * I". I called it "time-dependent SIR". However, to calculate parameter values (directly beta etc., not reproduction number here), I selected "phase-dependent" approach because periodicity and too large fluctuations made it difficult. For example, in Japan, we see Mon < Tue <= Wed <= Thr < Fri >= Sat > Sun in almost every week due to testing capacity. Phase-dependent approach enables us to ignore the fluctuations and to get physical meanings of parameter values (e.g. large rho indicates people meet each other frequently without masks).
How is tau specified/calculated? How come it is always known/specified beforehand?
"Tau is not set to a predetermined value. (With covsirphy.phase.phase_estimator.MPEstimator
,) its value is determined by the estimator only during the last phase and then uses that value for all the previous phases as well" (because tau value determines how time passes and is expected to be stable in all phases). This is from "tau parameter in models" of glosary page and discussion #392 with your pull request :-O
https://lisphilar.github.io/covid19-sir/markdown/TERM.html
(x,y,z,w) are given by (S,I,R,F) = N * (x,y,z,w). So in order to get them, we need to divide the taufree() dataframe columns by N, the total country population. But why do we divide with sum(x), which is the sum of each day's records respectively?
Simulator.non_dim()
is not called in Estimator
. This method is only for ExampleData
.Estimator._simulate()
calls Simulator.add()
and Simulator.taufree()
, not .non_dim()
because parameter estimation with fully non-dimensionalized model has dis-advantages as I noted in #5 (and #1).Oh yes we had discussed again about tau haha! So tau is the last parameter that is estimated by Optuna. Until the last phase, Optuna optimizes all the other 4 excluding tau as unknown, can it do that? I mean how can the error function be minimized if not all the variables are set to their trial values? Can you give me perhaps a call tree for tau usage across the phases?
If we have a stable solution with only 3 timepoints/data records, then could we set all the phases initially to 3 days and then merge similar phases in a bottom-up way? What is the point of using min_size = 7
or more? Maybe that's why the model validator concluded to 3 being a good value for rmsle..
parameter estimation with fully non-dimensionalized model has dis-advantages as I noted in..
Oh I wasn't aware of those past problems. So
.non_dim()
is not used. But I don't understand that line of code (out of curiosity):df = df.apply(lambda x: x / sum(x), axis=1)
I suppose df contains as columns the S,I,R,F. Theapply.lambda(..)
foraxis=1
takes each column (all column records) and divides it with its sum, right? So it's like we do S/sum(S), I/sum(I), R/sum(R), F/(sum(F)? That's what I don't get.
I mean how can the error function be minimized if not all the variables are set to their trial values? Can you give me perhaps a call tree for tau usage across the phases?
Internal process of Estimator
? Or, Scenario
-> Estimator
process?
could we set all the phases initially to 3 days and then merge similar phases in a bottom-up way?
Note that three-days phases are meaningful only when these phases do not include fluctuations. To remove misleading worsened scores (sorry, I could not find right term) when merging phases, the initial length of phases should be longer than 3. Perhaps 7 could be selected at this time.
df = df.apply(lambda x: x / sum(x), axis=1)
axis=1
is sometimes confusing, but this means S/sum(S, I, R, F), I/sum(S, I, R, F) and so on. Please run the following codes. df.sum(axis=1)
calculates total value in each row.
import pandas as pd
df = pd.DataFrame(
{
"A": [1, 2, 3],
"B": [0.1, 0.2, 0.3],
"C": [0.01, 0.02, 0.03]
}
)
df.sum(axis=1).tolist()
# -> [1.11, 2.22, 3.3299999999999996]: A+B+C in each row
df.apply(lambda x: sum(x), axis=1).tolist()
# -> [1.11, 2.22, 3.3299999999999996]: A+B+C in each row
On a separate note, JHUData
calculates "Susceptible" and Estimator
re-calculated "Susceptible" using SIRF.specialize()
internally. I will refactor them later.
I created draft pull request #706. Value range of kappa of SIR-F model will be updated from (0, 1) to (30%, 70%) quantiles of "Fatal.diff() / Time.diff() / Infected". This is because "kappa = (dF/dt) / I" when theta is around 0.
I don't understand, can you please explain why this is valid? Yes kappa = (dF/dt) / I
when theta is around 0, but how the value range is associated for the case where theta=0
and why these specific percentages?
Optuna searchs the best parameter values (all of theta/kappa/rho/sigma) with the value range returned by SIRF.param_range()
classmethod. At the current version, the range of theta/kappa is (0, 1). (30%, 70%) percentiles of - n * (dS/dt) / S / I
is for rho, (30%, 70%) percentiles of (dR/dt) / I
is for sigma. I determined 30% and 70% by testing at very old versions, and I will update the pull request to controll these percentages with Scenario.estimate()
via Estimator.run()
later.
Because "S* -> F" has very low probability of occurrence, we can assume theta is around 0. i.e. kappa is near to (dF/dt) / I
.
I updated pull request #706. With this pull request, we can change the quantiles with Scenario.estimate(quantiles=(0.3, 0.7))
(as default at this time).
I tested some values for Italy. With my local PC and example/scenario.analysis.py, around (0.1, 0.9) seems to be better for Italy at this time.
quantiles=(0.30, 0.70)
: RMSLE=0.128, 3 min 33 secquantiles=(0.20, 0.80)
: RMSLE=0.0796, 3 min 13 secquantiles=(0.10, 0.90)
: RMSLE=0.0731, 2 min 54 secquantiles=(0.05, 0.95)
: RMSLE=0.0734, 2 min 43 secWould it be wrong if we loosened up the range and just let them be (0,1)? And then it is of course up to the optimizer to choose the right value considering error. What is the benefit of the percentile restriction, shorter runtime? Because as we see this is not the case now. I mean initially what made you think of this restriction?
I did not tried optimization without restrictions at newer versions, but I think Optuna can find the right parameter values easily if the value range is narrow (yes, too narrow is not good). Also, when the value range is wide, Optuna may find some parameter sets which have lower RMSLE but are un-reasonable with the data.
I updated the pull request.
quantiles=(0.1, 0.9)
as defaultEstimator.run(timeout_iteration=10, allowance=(0.98, 1.02)
with Estimator.run(timeout_iteration=5, allowance=(0.99, 1.01)
. Estimator checks RMSLE and whether the max simulated value of each variable is in the allowance of "(0.99, 1.01) * max truth value" every 5 seconds.Pull request #706 also fixes negative Rt issue #668 in some countires, including Greece/Italy/Japan, by improvement of parameter estimation. Can we merge it to release 2.19.0 this week?
Could you please check also for USA, Poland and Russia? I think these had major forecast problems as well..
Negative reproduction number discussed in #668 will be removed in Italy/Greece/Japan/Poland/Russia data. All results (Scenario.summary()
) with pull request #706 are here.
https://gist.github.com/lisphilar/5ee175940557316e6baa1ccf275892ca
RMSLE score with USA data will be improved (0.27 -> 0.24), but issue #668 with USA will not be fixed. (RMSLE scores in the 0th/1st phase is large, 0.79, 0.24 respectively. Is this related?)
I will merge the pull request with issue #711 because parameter estimation will be improved. Discussion in #668 is still necessary to remove negative reproduction number in USA.
To simplify parameter estimation, do you have any ideas to determine tau value without Estimator? (i.e. Is it possible to estimate indipendently the best tau value for a country and ODE parameter values for phases?)
With pull request #720, to fix the last date of simulation results and to simplify the call tree of simulation, I created ode.ode_solver._ODESolver
(internal only) and ode.ode_datahandler.ODEHandler
(public) class.
They are not used in Scenario
class at this time. Call tree of Scenario.simulate()
will be changed later and ODESimulator
will be deprecated.
Please check the usage of these classes with tests/test_ode.py::TestODEHandler.
This dicussion will be continued in #718. Call tree may be changed there.
Summary of question
I wanted to ask you about the theory behind the ODE system solution, it is something that I was wondering about these days, sorry if I make naive perhaps questions.
If I understand correctly, we have 4 unknown target variables, S, I, R, F. Also we have the a1, a2, β, γ (constants in a phase) so another 4 unknown parameters. And finally the unknown tau parameter. So I count 9 unknown entities (4 target variables + 5 constant parameters). And the ODE system consists of 4 equations.
Sorry for all these questions, I needed to record them all together somewhere and I understand that it would be difficult for you to answer them all, but please take your time and just tell me your thoughts on the ones you don't know for sure, as a discussion. It would be very useful for me and will help me to unlock some ambiguous topics in the estimation process. In any case don't feel overwhelmed by these at all, I just thought to discuss these because it will be useful for us and all the users that read this to clarify the underlying theory behind the model estimation and fitting.
I stumbled upon these questions while dealing with the issue #693 and the
min_size
value.