Closed Inglezos closed 3 years ago
Yes, low accuracy of parameter estimation with SIR-F is the problem to solve. It may be necessary to change the system of parameter estimation with SIR-F model. We have only four observed variables to estimate four parameters.
The trajectory is for hyperparameter optimization with Optuna package.
Please provide the codes and results with Scenario.summary()
.
I'm just running the sirf_model.py
example from the example/
folder. The first image of the results is from the example/output/sirf_model
folder that is produced.
Apart from the ready example, I get the same result if I also run the code (the same with the notebook):
from pprint import pprint
import matplotlib.pyplot as plt
import pandas as pd
import covsirphy as cs
cs.get_version()
get_ipython().run_line_magic('matplotlib', 'inline')
pd.plotting.register_matplotlib_converters()
# Matplotlib
plt.style.use("seaborn-ticks")
plt.rcParams["xtick.direction"] = "in"
plt.rcParams["ytick.direction"] = "in"
plt.rcParams["font.size"] = 11.0
plt.rcParams["figure.figsize"] = (9, 6)
plt.rcParams["figure.dpi"] = (120)
data_loader = cs.DataLoader(directory="kaggle/input")
population_data = data_loader.population()
example_data = cs.ExampleData(tau=1440, start_date="01Jan2020")
example_data.cleaned()
# Model name
print(cs.SIRF.NAME)
# Example parameter values
pprint(cs.SIRF.EXAMPLE, compact=True)
model = cs.SIRF
area = {"country": "Full", "province": model.NAME}
# Add records with SIR model
example_data.add(model, **area)
# Records with model variables
df = example_data.specialized(model, **area)
cs.line_plot(
df.set_index("Date"),
title=f"Example data of {model.NAME} model",
y_integer=True
)
eg_r0 = model(model.EXAMPLE["population"], **model.EXAMPLE["param_dict"]).calc_r0()
df = example_data.specialized(model, **area)
x_max = df.loc[df["Infected"].idxmax(), "Susceptible"] / cs.SIRF.EXAMPLE["population"]
(x_max, 1/eg_r0)
# Set population value
population_data.update(cs.SIRF.EXAMPLE["population"], **area)
population_data.value(**area)
# Show records in JHU-style
sirf_snl = cs.Scenario(example_data, population_data, tau=1440, **area)
_ = sirf_snl.records()
# Set phases (phase: explained in "S-R trend analysis section")
# Records at 01Jan2020 will be removed because Recovered = 0
sirf_snl.clear(include_past=True)
sirf_snl.add().summary()
# Parameter estimation
sirf_snl.estimate(cs.SIRF)
sirf_snl.estimate_history("0th")
df = sirf_snl.summary()
setting_model = cs.SIRF(population=cs.SIRF.EXAMPLE["population"], **cs.SIRF.EXAMPLE["param_dict"])
setting_dict = {
"Population": cs.SIRF.EXAMPLE["population"],
"ODE": cs.SIRF.NAME,
"Rt": setting_model.calc_r0(),
"tau": 1440,
**setting_model.calc_days_dict(1440),
**cs.SIRF.EXAMPLE["param_dict"]
}
df = df.append(pd.Series(setting_dict, name="setting"))
df.fillna("-")
sirf_snl.estimate_accuracy("0th")
The sirf_snl.summary()
is:
| | Type | Start | End | Population | ODE | Rt | theta | kappa | rho | sigma | tau | 1/alpha2 [day] | 1/beta [day] | alpha1 [-] | 1/gamma [day] | RMSLE | Trials | Runtime |
|:----|:-------|:----------|:----------|-------------:|:------|-----:|----------:|-----------:|---------:|----------:|------:|-----------------:|---------------:|-------------:|----------------:|----------:|---------:|:-------------|
| 0th | Past | 02Jan2020 | 29Jun2020 | 1000000 | SIR-F | 2.56 | 0.0195567 | 0.00326198 | 0.203457 | 0.0747316 | 1440 | 306 | 4 | 0.02 | 13 | 0.0972948 | 2196 | 3 min 1 sec |
The trajectory is for hyperparameter optimization with Optuna package.
- Optuna suggest a parameter set and calculate error value (in CovsirPhy, RMSLE)
- To minimize the error value, Optuna suggest a new parameter set
- Repeats The figure shows all parameter sets. CovsirPhy uses the last parameter set as the parameter set of a phase.
Yes I mean how the plots are read? For example the rho-sigma plot (row=2, column=3) what does it mean?
In the second graph, points of rho-sigma plot are concentrated because hyperparameter optimization (minimization of the error by updating parameter set) was successfully completed.
In the first graph, points of rho-sigma plot are not concentrated because hyperparameter optimization was not done and parameter set in each step was randomly selected without minimization of error.
The cause of the first (not second) figure should be investigated.
Is there something wrong with the examples and the code? Or a general bug of the system? Can you please investigate this in order to find a solution? First rerun the code to compare results. This is a very fundamental example analysis, how come to have such an erroneous behavior? I found an error for SIRD as well (different kind).
The following result indicates theta and kappa of SIR-F model is estimated with lowa accuracy. They are elated to the same variable (Fatal).
rho | sigma | theta | kappa | |
---|---|---|---|---|
SIR-F | 0.203457 | 0.074732 | 0.019557 | 0.003262 |
set | 0.2 | 0.075 | 0.002 | 0.005 |
With class menthod covsirphy.ode.sirf.SIRF.param_range
, the value range of theta and kappa is set as (0, 1). The estimator suggest parameter values within this range. I think this is too wide to get the parameter values with high accuracy.
WIth the same class method, value range of rho is in (30%, 70%) quantile of - population * Susceptible.diff() / Susceptible / Infected
. Value range of sigma is in (30%, 70%) quantile of Recovered.diff() / Infected
, accoridng to ODE equation of SIR-F model.
Do you have any ideas to narrow the value range of kappa and theta with the dataset?
So to get this straight, which results are valid, mine or yours in kaggle? Why they differ? Did I do something wrong in the code or my results are correct?
The first thing to be answered is why this came up now? I mean at some point it gave the result you had, which is more accurate. How did you have the more normal kaggle results in the first place?
Nevertheless, with these parameters, the model seems to make good estimation of the example cases, why is that?
Could it mean that there is not a unique solution? Perhaps one parameter set corresponds to some simulated cases, but the inverse, these cases if they are analyzed, can lead to more than one solutions (which is the parameter set). Could both solutions be correct?
I re-runed the Kaggle Notebook and I get the following figure. This is expected.
However, in Google Colab, the same script produced the different figure with unknown reason. Plese refer to https://gist.github.com/lisphilar/df00c797b1b3e6041e1efce5b4b1eb3e
Could it mean that there is not a unique solution? Perhaps one parameter set corresponds to some simulated cases, but the inverse, these cases if they are analyzed, can lead to more than one solutions (which is the parameter set). Could both solutions be correct?
Please provide the details.
How is this possible? The results should not be different. So for some reason me and collab have the same weird results. Could the cores be the problem? How many are you using? I think this is a high priority problem, because it could affect total model performance.
Please provide the details.
I don't have specific details, it's just a thought.
Kaggle Notebook: 4 CPUs, expected figure Google Colab: 2 CPUs, weired Local PC: 8 CPUs, weired
The number of cores does not affect on the results because estimation in one phase is performed with single core.
Becuase SIR and SIR-D do not show the weired figures, we many need to investigate covisrphy/ode/sirf.py
I got a little confused, which one shows the weird results, collab or your pc? https://gist.github.com/lisphilar/df00c797b1b3e6041e1efce5b4b1eb3e is your pc results, Kaggle's or collab's?
Could something go wrong when we had changed the timeout
and timeout_iteration
arguments values in simulation.estimator()
?
Or this has to do with the sirf.py
file exclusively?
The gist notebook is the result of Collab. The result in Kaggle Notebook is in https://www.kaggle.com/lisphilar/covid-19-data-with-sir-model The result in my local PC with example/sirf_mode.py is here.
rho | sigma | theta | kappa | tau | Rt | alpha1 [-] | 1/alpha2 [day] | 1/beta [day] | 1/gamma [day] | RMSLE | Trials | Runtime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SIR-F | 0.203457 | 0.074732 | 0.019557 | 0.003262 | 1440 | 2.56 | 0.02 | 306 | 4 | 13 | 0.097295 | 2636 | 3 min 1 sec |
set | 0.2 | 0.075 | 0.002 | 0.005 | 1440 |
Because timeout=180
and timeout_iteration=10
are default values of Estimator.run()
as well as Scenario.estimate()`, I think this is not related to this issue...
Kaggle notebook and the Colob notebook uses Scenario.estimate()
, but example/sirf_model.py uses Estimator
class directly. Could I update example/sirf_model.py to use Scenario.esitimate()
?
Which one is more correct? If this were the cause, then collab would be same with kaggle; but this does not happen.
The result in my Kaggle Notebook is expected (more correct).
Yes of course that seems more correct, but I meant which is the best way to call estimator, using Scenario.estimate()
or directly the Estimator
class?
Sorry for misunderstanding.
Because we use Scenario
class as the interface of CovsirPhy at this time, it is preferred to call Scenario.estimate()
to avoid confusion.
Yes I agree, but why the different behavior? Nevertheless, the problem unfortunately remains I think right?
The problem remains. I'm reading codes (Scenario.estimate()
, PramTracker.estimate()
, MPEstimator
, PhaseUnit.estimate()
and Estimator
), but I do not have any ideas to solve this issue.
Actually, example/sirf.py and so on are not suitable for example codes. How to crearte theoritiacal (example with user-defined parameter values) is documented in https://lisphilar.github.io/covid19-sir/usage_theoretical.html
We ModelValidator
class to validate models.
What does this mean? That we should not run parameter estimation on the example codes? No hyper-parameter optimization can be done? Why is that? I thought the estimator analyzes data to estimate parameters, where does this go wrong?
Ohhh I just got what you mean, you say that we should use the ModelValidator
for the example codes and not the actual estimator!
Yes, sirf.py etc. should be replaced with example/model_validation.py and https://lisphilar.github.io/covid19-sir/usage_theoretical.html
So what is the purpose then of sir_model.py
, sird_model.py
, sirf_model.py
, sirfv_model.py
, sewirf_model.py
? Why the estimator fails for these examples? Do these files need to be reworked?
The files were created to show the usage of ExampleData
class when we did not have example/model_validation.py and https://lisphilar.github.io/covid19-sir/usage_theoretical.html
I think the roles of the files are over and we remove these files from example folder.
sir_model.py, sird_model.py: not failed sirf_model.py: Not failed because accuracy is high, but there is a room to improve SIR-F paramter estimation sirfv_model.py, sewirf_mode.py: parameter estimation should not be applied because the number of parameters is too larger than that of measurable variables.
Okay just to understand the general picture, we have a problem with the estimator, the kaggle notebook got the plots right, while me, you and collab have weird estimations for SIRF example. Is the root cause for this the fact that we should use the model validator instead or not? Will this solve the problem? Because if not, then I would say that SIRF failed if we still get the same weird results, different than kaggle's assumed correct ones.
In order to realize if this works, how can we get from the model_validation.py
the respective plot for "Trajectory of parameter values in hyperparameter estimation" and the "estimated accuracy" for SIRF? I mean the respective plots that correspond to the following kaggle plots:
The model_validation.py
must be reworked in order to save these plots as images as well.
Only then we can compare the results and realize if ModelValidator usage solves the problem indeed.
But unfortunately, from some first runs I see that the estimated and actual parameters differ in the saved long_restricted.csv
logfile.
ModelValidator
does not have the method to show the plots, but units_estimated[0]._estimator.history(filename)
and units_estimated[0]._estimator.accuracy(filename)
may create the figures becuase units_estimated[i]
(i=0, 1, 2,...) is a instance of covsirphy.phase.phase_unit.PhaseUnit
class.
I ran model_validation.py and also found that estimated and actual parameter values show large difference, while SIR and SIR-D successed.
We need to revise SIRF
class to improve accuracy of estimation., including param_range
method.
Yes that is not the cause. Perhaps we should add such plotting methods, that would be useful.
alpha1 parameter in SIRF and SEWIRF is in [1/min] or dimensionless? Because if it is dimensionless, since it expresses a probability, then it should not be named "mortality rate", it confuses, because rate has dimensions [1/min]. Perhaps "mortality probability" or something like that would be more appropriate.
Sorry, alpha1 is dimentionless and documentation should be revised.
Shall we name this "direct fatality probability"?
I think I used "probability of cases fatal before confirmed". Anyway, we need to revise the term with clear expression.
I agree with the idea we use "direct".
And in "Control factors of effective contact rate β1" section, in f1 equation, what is the " s " variable before " c " at the end?
Actually, wow!, all these control factors for beta, gamma, alpha, where did you find them? Did you come up with this analysis by yourself ??? :O
These control factors about beta1, gamma and alpha2 concern only SEWIRF model or are more general and apply to SIRF as well? For SIRF I mean mostly the gamma and alpha2 control factors.
I and some Kagglers discussed and found control factors. The controll factors are for SIR-F. ("Exposed" and "Waiting" are variables in SEWIR-F model, but we can use "exposure" and "waiting for diagnosis" as controll factors in SIR-F model.) Yes, with new issues, we can create features of CovsirPhy to discuss parameter values deeply with cotroll factors.
Do you have any ideas regarding improvement of SIRF
class?
Do you have any references about these equations or the various control factors?
In "Control factors of effective contact rate β1" section, in f1 equation, what is the " s " variable before " c " at the end? As I see there is no description for this factor.
So the control factor analysis for beta1 of the SEWIRF applies to beta of the SIRF as well, and gamma, alpha2 analysis of SEWIRF applies to gamma, alpha2 of SIRF as well? I mean the same control factors are for both models for beta, beta1, gamma, alpha2, right?
No, I don't have currently any ideas to improve SIRF class. Could something be going on with the timeout, trials or n_jobs values?
Do you have any references about these equations or the various control factors?
I did not find references except for SIR, (SIR-D) and SEIR model, but we my need to read all comments of the Kaggle Notebook...
In "Control factors of effective contact rate β1" section, in f1 equation, what is the " s " variable before " c " at the end? As I see there is no description for this factor.
Oops, "s" should be removed."s" is not in the next equation "beta1 = 1/49 {}...".
So the control factor analysis for beta1 of the SEWIRF applies to beta of the SIRF as well, and gamma, alpha2 analysis of SEWIRF applies to gamma, alpha2 of SIRF as well? I mean the same control factors are for both models for beta, beta1, gamma, alpha2, right?
Yes, except for beta. Beta in SIR-F model is the function of beta1, beta2 and beta3.
No, I don't have currently any ideas to improve SIRF class. Could something be going on with the timeout, trials or n_jobs values?
Update of timeout/the number of trial seems not impact on performance.
In SIRF.param_range
, value range of the parameters are set.
Range of rho in SIR-F is (30%, 70%) quantile of "- n * (dS/dt) / S / I".
Range of sigma is (30%, 70%) quantile of "(dR/dt) / I".
However, range of theta is (0, 1) and the range of kappa is (0, 1). Can we restrict the value range of theta and kappa?
Yes, except for beta. Beta in SIR-F model is the function of beta1, beta2 and beta3.
So for SIRF beta, we could say that it has at least the control factors of SEWIRF's beta1 ? No need to get an equation, just to determine what factors affect beta. You mean we can say beta (from SIRF) = f(beta1,beta2,beta3) (from SEWIRF) ?
However, range of theta is (0, 1) and the range of kappa is (0, 1). Can we restrict the value range of theta and kappa?
Perhaps yes, in the same way with rho and sigma, restricting them to a quantile? Can you try that?
You mean we can say beta (from SIRF) = f(beta1,beta2,beta3) (from SEWIRF) ?
Yes.
Perhaps yes, in the same way with rho and sigma, restricting them to a quantile? Can you try that?
I'm trying theta = (0, 0.01) etc., but performance was not improeved...
We need to find alternative ways to update Estimator
and SIRF
class.
(also update alpha1 in SEWIRF, is dimensionless [-])
How shall we proceed this issue?
I am not sure. Maybe this can be integrated into 354 issue? Or is it related to the example model only?
This is related to #354 and an issue for SIR-F model and parameter etimation. We will move forward to #354.
Summary of question
For SIRF Hyperparameter optimization example, why are the estimated parameters so much different than the expected? Is this bad or this is simply just another set of solutions for the parameters, which lead to the same case results? I get for theta = 0.0195 (expected is 0.002) and for kappa = 0.0032 (expected is 0.005).
What does the trajectory plot exactly mean? I see that there is not a single solution set. For this plot I get:
but you have something totally different in the kaggle notebook results for SIRF example: