lisphilar / covid19-sir

CovsirPhy: Python library for COVID-19 analysis with phase-dependent SIR-derived ODE models.
https://lisphilar.github.io/covid19-sir/
Apache License 2.0
109 stars 44 forks source link

[Discuss] Questions about the SIRF hyperparameter optimization example #394

Closed Inglezos closed 3 years ago

Inglezos commented 3 years ago

Summary of question

  1. For SIRF Hyperparameter optimization example, why are the estimated parameters so much different than the expected? Is this bad or this is simply just another set of solutions for the parameters, which lead to the same case results? I get for theta = 0.0195 (expected is 0.002) and for kappa = 0.0032 (expected is 0.005).

  2. What does the trajectory plot exactly mean? I see that there is not a single solution set. For this plot I get: sirf_example_parameters

but you have something totally different in the kaggle notebook results for SIRF example: image

lisphilar commented 3 years ago

Yes, low accuracy of parameter estimation with SIR-F is the problem to solve. It may be necessary to change the system of parameter estimation with SIR-F model. We have only four observed variables to estimate four parameters.

The trajectory is for hyperparameter optimization with Optuna package.

Please provide the codes and results with Scenario.summary().

Inglezos commented 3 years ago

I'm just running the sirf_model.py example from the example/ folder. The first image of the results is from the example/output/sirf_model folder that is produced. Apart from the ready example, I get the same result if I also run the code (the same with the notebook):

from pprint import pprint
import matplotlib.pyplot as plt
import pandas as pd
import covsirphy as cs

cs.get_version()
get_ipython().run_line_magic('matplotlib', 'inline')
pd.plotting.register_matplotlib_converters()

# Matplotlib
plt.style.use("seaborn-ticks")
plt.rcParams["xtick.direction"] = "in"
plt.rcParams["ytick.direction"] = "in"
plt.rcParams["font.size"] = 11.0
plt.rcParams["figure.figsize"] = (9, 6)
plt.rcParams["figure.dpi"] = (120)

data_loader = cs.DataLoader(directory="kaggle/input")
population_data = data_loader.population()

example_data = cs.ExampleData(tau=1440, start_date="01Jan2020")
example_data.cleaned()

# Model name
print(cs.SIRF.NAME)
# Example parameter values
pprint(cs.SIRF.EXAMPLE, compact=True)

model = cs.SIRF
area = {"country": "Full", "province": model.NAME}
# Add records with SIR model
example_data.add(model, **area)
# Records with model variables
df = example_data.specialized(model, **area)
cs.line_plot(
    df.set_index("Date"),
    title=f"Example data of {model.NAME} model",
    y_integer=True
)

eg_r0 = model(model.EXAMPLE["population"], **model.EXAMPLE["param_dict"]).calc_r0()
df = example_data.specialized(model, **area)
x_max = df.loc[df["Infected"].idxmax(), "Susceptible"] / cs.SIRF.EXAMPLE["population"]
(x_max, 1/eg_r0)

# Set population value
population_data.update(cs.SIRF.EXAMPLE["population"], **area)
population_data.value(**area)

# Show records in JHU-style
sirf_snl = cs.Scenario(example_data, population_data, tau=1440, **area)
_ = sirf_snl.records()

# Set phases (phase: explained in "S-R trend analysis section")
# Records at 01Jan2020 will be removed because Recovered = 0
sirf_snl.clear(include_past=True)
sirf_snl.add().summary()

# Parameter estimation
sirf_snl.estimate(cs.SIRF)
sirf_snl.estimate_history("0th")

df = sirf_snl.summary()
setting_model = cs.SIRF(population=cs.SIRF.EXAMPLE["population"], **cs.SIRF.EXAMPLE["param_dict"])
setting_dict = {
    "Population": cs.SIRF.EXAMPLE["population"],
    "ODE": cs.SIRF.NAME,
    "Rt": setting_model.calc_r0(),
    "tau": 1440,
    **setting_model.calc_days_dict(1440),
    **cs.SIRF.EXAMPLE["param_dict"]
}
df = df.append(pd.Series(setting_dict, name="setting"))
df.fillna("-")

sirf_snl.estimate_accuracy("0th")

The sirf_snl.summary() is:

|     | Type   | Start     | End       |   Population | ODE   |   Rt |     theta |      kappa |      rho |     sigma |   tau |   1/alpha2 [day] |   1/beta [day] |   alpha1 [-] |   1/gamma [day] |     RMSLE |   Trials | Runtime      |
|:----|:-------|:----------|:----------|-------------:|:------|-----:|----------:|-----------:|---------:|----------:|------:|-----------------:|---------------:|-------------:|----------------:|----------:|---------:|:-------------|
| 0th | Past   | 02Jan2020 | 29Jun2020 |      1000000 | SIR-F | 2.56 | 0.0195567 | 0.00326198 | 0.203457 | 0.0747316 |  1440 |              306 |              4 |         0.02 |              13 | 0.0972948 |     2196 | 3 min  1 sec |
Inglezos commented 3 years ago

The trajectory is for hyperparameter optimization with Optuna package.

  • Optuna suggest a parameter set and calculate error value (in CovsirPhy, RMSLE)
  • To minimize the error value, Optuna suggest a new parameter set
  • Repeats The figure shows all parameter sets. CovsirPhy uses the last parameter set as the parameter set of a phase.

Yes I mean how the plots are read? For example the rho-sigma plot (row=2, column=3) what does it mean?

lisphilar commented 3 years ago

In the second graph, points of rho-sigma plot are concentrated because hyperparameter optimization (minimization of the error by updating parameter set) was successfully completed.

In the first graph, points of rho-sigma plot are not concentrated because hyperparameter optimization was not done and parameter set in each step was randomly selected without minimization of error.

The cause of the first (not second) figure should be investigated.

Inglezos commented 3 years ago

Is there something wrong with the examples and the code? Or a general bug of the system? Can you please investigate this in order to find a solution? First rerun the code to compare results. This is a very fundamental example analysis, how come to have such an erroneous behavior? I found an error for SIRD as well (different kind).

lisphilar commented 3 years ago

The following result indicates theta and kappa of SIR-F model is estimated with lowa accuracy. They are elated to the same variable (Fatal).

  rho sigma theta kappa
SIR-F 0.203457 0.074732 0.019557 0.003262
set 0.2 0.075 0.002 0.005

With class menthod covsirphy.ode.sirf.SIRF.param_range, the value range of theta and kappa is set as (0, 1). The estimator suggest parameter values within this range. I think this is too wide to get the parameter values with high accuracy.

WIth the same class method, value range of rho is in (30%, 70%) quantile of - population * Susceptible.diff() / Susceptible / Infected. Value range of sigma is in (30%, 70%) quantile of Recovered.diff() / Infected, accoridng to ODE equation of SIR-F model.

Do you have any ideas to narrow the value range of kappa and theta with the dataset?

Inglezos commented 3 years ago
  1. So to get this straight, which results are valid, mine or yours in kaggle? Why they differ? Did I do something wrong in the code or my results are correct?

  2. The first thing to be answered is why this came up now? I mean at some point it gave the result you had, which is more accurate. How did you have the more normal kaggle results in the first place?

  3. Nevertheless, with these parameters, the model seems to make good estimation of the example cases, why is that?

Inglezos commented 3 years ago

Could it mean that there is not a unique solution? Perhaps one parameter set corresponds to some simulated cases, but the inverse, these cases if they are analyzed, can lead to more than one solutions (which is the parameter set). Could both solutions be correct?

lisphilar commented 3 years ago

I re-runed the Kaggle Notebook and I get the following figure. This is expected. __results___102_0

However, in Google Colab, the same script produced the different figure with unknown reason. Plese refer to https://gist.github.com/lisphilar/df00c797b1b3e6041e1efce5b4b1eb3e

Could it mean that there is not a unique solution? Perhaps one parameter set corresponds to some simulated cases, but the inverse, these cases if they are analyzed, can lead to more than one solutions (which is the parameter set). Could both solutions be correct?

Please provide the details.

Inglezos commented 3 years ago

How is this possible? The results should not be different. So for some reason me and collab have the same weird results. Could the cores be the problem? How many are you using? I think this is a high priority problem, because it could affect total model performance.

Please provide the details.

I don't have specific details, it's just a thought.

lisphilar commented 3 years ago

Kaggle Notebook: 4 CPUs, expected figure Google Colab: 2 CPUs, weired Local PC: 8 CPUs, weired

The number of cores does not affect on the results because estimation in one phase is performed with single core.

Becuase SIR and SIR-D do not show the weired figures, we many need to investigate covisrphy/ode/sirf.py

Inglezos commented 3 years ago

I got a little confused, which one shows the weird results, collab or your pc? https://gist.github.com/lisphilar/df00c797b1b3e6041e1efce5b4b1eb3e is your pc results, Kaggle's or collab's?

Could something go wrong when we had changed the timeout and timeout_iteration arguments values in simulation.estimator() ?

Or this has to do with the sirf.py file exclusively?

lisphilar commented 3 years ago

The gist notebook is the result of Collab. The result in Kaggle Notebook is in https://www.kaggle.com/lisphilar/covid-19-data-with-sir-model The result in my local PC with example/sirf_mode.py is here. estimate_history

  rho sigma theta kappa tau Rt alpha1 [-] 1/alpha2 [day] 1/beta [day] 1/gamma [day] RMSLE Trials Runtime
SIR-F 0.203457 0.074732 0.019557 0.003262 1440 2.56 0.02 306 4 13 0.097295 2636 3 min  1 sec
set 0.2 0.075 0.002 0.005 1440                

Because timeout=180 and timeout_iteration=10 are default values of Estimator.run() as well as Scenario.estimate()`, I think this is not related to this issue...

lisphilar commented 3 years ago

Kaggle notebook and the Colob notebook uses Scenario.estimate(), but example/sirf_model.py uses Estimator class directly. Could I update example/sirf_model.py to use Scenario.esitimate()?

Inglezos commented 3 years ago

Which one is more correct? If this were the cause, then collab would be same with kaggle; but this does not happen.

lisphilar commented 3 years ago

The result in my Kaggle Notebook is expected (more correct).

Inglezos commented 3 years ago

Yes of course that seems more correct, but I meant which is the best way to call estimator, using Scenario.estimate() or directly the Estimator class?

lisphilar commented 3 years ago

Sorry for misunderstanding. Because we use Scenario class as the interface of CovsirPhy at this time, it is preferred to call Scenario.estimate() to avoid confusion.

Inglezos commented 3 years ago

Yes I agree, but why the different behavior? Nevertheless, the problem unfortunately remains I think right?

lisphilar commented 3 years ago

The problem remains. I'm reading codes (Scenario.estimate(), PramTracker.estimate(), MPEstimator, PhaseUnit.estimate() and Estimator), but I do not have any ideas to solve this issue.

lisphilar commented 3 years ago

Actually, example/sirf.py and so on are not suitable for example codes. How to crearte theoritiacal (example with user-defined parameter values) is documented in https://lisphilar.github.io/covid19-sir/usage_theoretical.html

We ModelValidator class to validate models.

Inglezos commented 3 years ago

What does this mean? That we should not run parameter estimation on the example codes? No hyper-parameter optimization can be done? Why is that? I thought the estimator analyzes data to estimate parameters, where does this go wrong?

Inglezos commented 3 years ago

Ohhh I just got what you mean, you say that we should use the ModelValidator for the example codes and not the actual estimator!

lisphilar commented 3 years ago

Yes, sirf.py etc. should be replaced with example/model_validation.py and https://lisphilar.github.io/covid19-sir/usage_theoretical.html

Inglezos commented 3 years ago

So what is the purpose then of sir_model.py, sird_model.py, sirf_model.py, sirfv_model.py, sewirf_model.py? Why the estimator fails for these examples? Do these files need to be reworked?

lisphilar commented 3 years ago

The files were created to show the usage of ExampleData class when we did not have example/model_validation.py and https://lisphilar.github.io/covid19-sir/usage_theoretical.html

I think the roles of the files are over and we remove these files from example folder.

sir_model.py, sird_model.py: not failed sirf_model.py: Not failed because accuracy is high, but there is a room to improve SIR-F paramter estimation sirfv_model.py, sewirf_mode.py: parameter estimation should not be applied because the number of parameters is too larger than that of measurable variables.

Inglezos commented 3 years ago

Okay just to understand the general picture, we have a problem with the estimator, the kaggle notebook got the plots right, while me, you and collab have weird estimations for SIRF example. Is the root cause for this the fact that we should use the model validator instead or not? Will this solve the problem? Because if not, then I would say that SIRF failed if we still get the same weird results, different than kaggle's assumed correct ones.

Inglezos commented 3 years ago

In order to realize if this works, how can we get from the model_validation.py the respective plot for "Trajectory of parameter values in hyperparameter estimation" and the "estimated accuracy" for SIRF? I mean the respective plots that correspond to the following kaggle plots: image image

The model_validation.py must be reworked in order to save these plots as images as well.

Inglezos commented 3 years ago

Only then we can compare the results and realize if ModelValidator usage solves the problem indeed.

Inglezos commented 3 years ago

But unfortunately, from some first runs I see that the estimated and actual parameters differ in the saved long_restricted.csv logfile.

lisphilar commented 3 years ago

ModelValidator does not have the method to show the plots, but units_estimated[0]._estimator.history(filename) and units_estimated[0]._estimator.accuracy(filename) may create the figures becuase units_estimated[i] (i=0, 1, 2,...) is a instance of covsirphy.phase.phase_unit.PhaseUnit class.

lisphilar commented 3 years ago

I ran model_validation.py and also found that estimated and actual parameter values show large difference, while SIR and SIR-D successed. We need to revise SIRF class to improve accuracy of estimation., including param_range method.

Inglezos commented 3 years ago

Yes that is not the cause. Perhaps we should add such plotting methods, that would be useful.

Inglezos commented 3 years ago

alpha1 parameter in SIRF and SEWIRF is in [1/min] or dimensionless? Because if it is dimensionless, since it expresses a probability, then it should not be named "mortality rate", it confuses, because rate has dimensions [1/min]. Perhaps "mortality probability" or something like that would be more appropriate.

lisphilar commented 3 years ago

Sorry, alpha1 is dimentionless and documentation should be revised.

Inglezos commented 3 years ago

Shall we name this "direct fatality probability"?

lisphilar commented 3 years ago

I think I used "probability of cases fatal before confirmed". Anyway, we need to revise the term with clear expression.

lisphilar commented 3 years ago

I agree with the idea we use "direct".

Inglezos commented 3 years ago

And in "Control factors of effective contact rate β1" section, in f1 equation, what is the " s " variable before " c " at the end?

Inglezos commented 3 years ago

Actually, wow!, all these control factors for beta, gamma, alpha, where did you find them? Did you come up with this analysis by yourself ??? :O

Inglezos commented 3 years ago

These control factors about beta1, gamma and alpha2 concern only SEWIRF model or are more general and apply to SIRF as well? For SIRF I mean mostly the gamma and alpha2 control factors.

lisphilar commented 3 years ago

I and some Kagglers discussed and found control factors. The controll factors are for SIR-F. ("Exposed" and "Waiting" are variables in SEWIR-F model, but we can use "exposure" and "waiting for diagnosis" as controll factors in SIR-F model.) Yes, with new issues, we can create features of CovsirPhy to discuss parameter values deeply with cotroll factors.

Do you have any ideas regarding improvement of SIRF class?

Inglezos commented 3 years ago
  1. Do you have any references about these equations or the various control factors?

  2. In "Control factors of effective contact rate β1" section, in f1 equation, what is the " s " variable before " c " at the end? As I see there is no description for this factor.

  3. So the control factor analysis for beta1 of the SEWIRF applies to beta of the SIRF as well, and gamma, alpha2 analysis of SEWIRF applies to gamma, alpha2 of SIRF as well? I mean the same control factors are for both models for beta, beta1, gamma, alpha2, right?

  4. No, I don't have currently any ideas to improve SIRF class. Could something be going on with the timeout, trials or n_jobs values?

lisphilar commented 3 years ago

Do you have any references about these equations or the various control factors?

I did not find references except for SIR, (SIR-D) and SEIR model, but we my need to read all comments of the Kaggle Notebook...

In "Control factors of effective contact rate β1" section, in f1 equation, what is the " s " variable before " c " at the end? As I see there is no description for this factor.

Oops, "s" should be removed."s" is not in the next equation "beta1 = 1/49 {}...".

So the control factor analysis for beta1 of the SEWIRF applies to beta of the SIRF as well, and gamma, alpha2 analysis of SEWIRF applies to gamma, alpha2 of SIRF as well? I mean the same control factors are for both models for beta, beta1, gamma, alpha2, right?

Yes, except for beta. Beta in SIR-F model is the function of beta1, beta2 and beta3.

No, I don't have currently any ideas to improve SIRF class. Could something be going on with the timeout, trials or n_jobs values?

Update of timeout/the number of trial seems not impact on performance.

In SIRF.param_range, value range of the parameters are set. Range of rho in SIR-F is (30%, 70%) quantile of "- n * (dS/dt) / S / I". Range of sigma is (30%, 70%) quantile of "(dR/dt) / I".

However, range of theta is (0, 1) and the range of kappa is (0, 1). Can we restrict the value range of theta and kappa?

Inglezos commented 3 years ago

Yes, except for beta. Beta in SIR-F model is the function of beta1, beta2 and beta3.

So for SIRF beta, we could say that it has at least the control factors of SEWIRF's beta1 ? No need to get an equation, just to determine what factors affect beta. You mean we can say beta (from SIRF) = f(beta1,beta2,beta3) (from SEWIRF) ?

However, range of theta is (0, 1) and the range of kappa is (0, 1). Can we restrict the value range of theta and kappa?

Perhaps yes, in the same way with rho and sigma, restricting them to a quantile? Can you try that?

lisphilar commented 3 years ago

You mean we can say beta (from SIRF) = f(beta1,beta2,beta3) (from SEWIRF) ?

Yes.

Perhaps yes, in the same way with rho and sigma, restricting them to a quantile? Can you try that?

I'm trying theta = (0, 0.01) etc., but performance was not improeved... We need to find alternative ways to update Estimator and SIRF class.

Inglezos commented 3 years ago

(also update alpha1 in SEWIRF, is dimensionless [-])

lisphilar commented 3 years ago

How shall we proceed this issue?

Inglezos commented 3 years ago

I am not sure. Maybe this can be integrated into 354 issue? Or is it related to the example model only?

lisphilar commented 3 years ago

This is related to #354 and an issue for SIR-F model and parameter etimation. We will move forward to #354.