[Discuss] USA weird death cases predictions depending on random execution trials

Inglezos commented 4 years ago

Summary

For USA, the confirmed cases predictions for 24Oct20 are a little bit off by ~250.000, but deaths cases predictions are totally wrong (actual ~229.000, predicted ~455.000 almost doubled). The first time I executed:

Codes and outputs:

   usa_scenario = cs.Scenario(jhu_data, pop_data, "USA")
   usa_scenario.records().tail()
   _ = usa_scenario.trend()
   usa_scenario.summary()
   usa_scenario.estimate(cs.SIRF)
   usa_scenario.summary()
   _ = usa_scenario.history_rate()
   usa_scenario.clear()
   usa_scenario.add(days=7)
   usa_pred = usa_scenario.simulate() -> observe usa_pred

usa_weird_death_predictions_values

I found a weird way to overcome this though. If I run again the trends and the estimator without reinitializing the usascenario object: = usa_scenario.trend() usa_scenario.summary() usa_scenario.estimate(cs.SIRF) usascenario.summary() = usa_scenario.history_rate() usa_scenario.clear() usa_scenario.add(days=7) usa_pred = usa_scenario.simulate() -> observe usa_pred

Then the deaths are very accurate (off only by ~5000).

usa_normal_death_predictions_values

Why is this happening and how can this be solved? Please further investigate.

Also, one time I noticed and a third behavior. On some execution trial the weird death predictions followed the same weird triangular pattern (as in Greece deaths predictions, see #271 and as in Russia/Spain death predictions, see #273). And this time if I rerun the analysis as I said before without reinitializing the scenario object, the error now still persists. It is very weird and frustrating and seems to be random depending on the execution trial.

usa_weird_death_predictions_graph

Also, the predicted confirmed cases are wrong (280.000 cases off) compared with the actual data.

Environment

CovsirPhy version: 2.9.1
Python version; 3.8
Installation: Anaconda, Spyder, pip
OS: Windows

lisphilar commented 4 years ago

Please help with me to try Scenario.estimate(cs.SIRF, timeout=120) mentioned in #274 to find the cause.

lisphilar commented 4 years ago

Except for 0th phase (09Feb2020 - 05May2020), I did not find weired results with COVID-19 Data Hub. We need to focus on 0th phase. https://github.com/lisphilar/covid19-sir/blob/master/example/scenario_analysis_usa.py

Actual data: usa_records

Simulated data: usa_simulate

Summary:

	Start	End	Rt	theta	kappa	rho	sigma	tau	RMSLE	Trials	Runtime
0th	9-Feb-20	5-May-20	5	0.00102	0.001213	0.021878	0.003155	720	20.79586	1030	1 min 0 sec
1st	6-May-20	14-Jun-20	2.1	0.032899	0.000237	0.009478	0.004134	720	0.190943	1054	1 min 0 sec
2nd	15-Jun-20	5-Jul-20	3.62	0.001569	0.000191	0.01195	0.003101	720	0.036643	1060	1 min 0 sec
3rd	6-Jul-20	18-Jul-20	4.39	0.009637	1.43E-05	0.015224	0.003423	720	0.01313	232	0 min 15 sec
4th	19-Jul-20	29-Jul-20	2.43	0.001015	0.00015	0.012798	0.005119	720	0.008411	229	0 min 15 sec
5th	30-Jul-20	12-Aug-20	2.11	0.016288	3.93E-05	0.008762	0.004049	720	0.00849	531	0 min 35 sec
6th	13-Aug-20	27-Aug-20	1.8	0.020675	1.42E-05	0.006329	0.003422	720	0.007471	226	0 min 15 sec
7th	28-Aug-20	13-Sep-20	1.62	0.003562	7.91E-05	0.005084	0.003047	720	0.008261	147	0 min 10 sec
8th	14-Sep-20	29-Sep-20	1.66	0.003181	9.62E-05	0.005095	0.002954	720	0.010162	312	0 min 20 sec
9th	30-Sep-20	12-Oct-20	2.34	0.001674	5.43E-05	0.005576	0.002324	720	0.006276	156	0 min 10 sec
10th	13-Oct-20	22-Oct-20	2.39	0.004536	9.26E-05	0.006864	0.002772	720	0.006384	159	0 min 10 sec
11th	23-Oct-20	1-Nov-20	2.57	0.000555	0.000111	0.007181	0.00268	720	0.007266	756	0 min 30 sec

Inglezos commented 4 years ago

That's interesting, because if I run USA analysis, not by that script, but using the following lines (which do essentially the same thing I guess since they are taken from the kaggle notebook analysis), it shows very weird and wrong results (I run with version CovsirPhy v2.9.1-delta.new.290.fix.282.283.296):

import covsirphy as cs

from collections import defaultdict
from datetime import timedelta
from dateutil.relativedelta import relativedelta
import functools
from IPython.display import display, Markdown
import math
import os
from pprint import pprint
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib
from matplotlib.ticker import ScalarFormatter
import numpy as np
import pandas as pd
import dask.dataframe as dd
import seaborn as sns
import scipy as sci
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import sympy as sym

def main():
    cs.get_version()
    get_ipython().run_line_magic('matplotlib', 'inline')
    pd.plotting.register_matplotlib_converters()

    np.random.seed(123)
    os.environ["PYTHONHASHSEED"] = "123"
    # Matplotlib
    plt.style.use("seaborn-ticks")
    plt.rcParams["xtick.direction"] = "in"
    plt.rcParams["ytick.direction"] = "in"
    plt.rcParams["font.size"] = 11.0
    plt.rcParams["figure.figsize"] = (9, 6)
    # Pandas
    pd.set_option("display.max_colwidth", 1000)

    data_loader = cs.DataLoader(directory="kaggle/input")
    jhu_data = data_loader.jhu()
    population_data = data_loader.population()
    usa_scenario = cs.Scenario(jhu_data, population_data, "USA")
    usa_scenario.records().tail()
    _ = usa_scenario.trend()
    usa_scenario.summary()
    usa_scenario.estimate(cs.SIRF)
    usa_scenario.summary()
    _ = usa_scenario.history_rate()
    usa_scenario.clear()
    usa_scenario.add(days=7)
    usa_scenario.simulate().tail(7).style.background_gradient(axis=0)

if __name__ == "__main__":
    main()

usa_custom_weird_analysis1 usa_custom_weird_analysis2 usa_custom_weird_analysis3 usa_custom_weird_analysis4

What is going on? Is this an Estimator class bug/problem? Or am I doing something very very wrong??? Is there something wrong in the code I try? Can you run the same code by yourself?

If I run your code then I get your normal results, except for one time when I got similar results to mine. Why is this happening? You are saving the results in CSVs and figures locally, while I am printing them inline in Spyder console with matplotlib and I have more #includes? Could any of that be the problem? It would be weird though. Could it be random and happens indeed on random execution trials for some unexplained reason?

For example, even that code fails:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from collections import defaultdict
from datetime import timedelta
from dateutil.relativedelta import relativedelta
import functools
from IPython.display import display, Markdown
import math
import os
from pprint import pprint
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib
from matplotlib.ticker import ScalarFormatter
import numpy as np
import pandas as pd
import dask.dataframe as dd
import seaborn as sns
import scipy as sci
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import sympy as sym

from pathlib import Path
import warnings
import covsirphy as cs

def main():
    # cs.get_version()
    # get_ipython().run_line_magic('matplotlib', 'inline')
    # pd.plotting.register_matplotlib_converters()

    # # Matplotlib
    # plt.style.use("seaborn-ticks")
    # plt.rcParams["xtick.direction"] = "in"
    # plt.rcParams["ytick.direction"] = "in"
    # plt.rcParams["font.size"] = 11.0
    # plt.rcParams["figure.figsize"] = (9, 6)

    warnings.simplefilter("error")
    # Create output directory in example directory
    code_path = Path(__file__)
    input_dir = code_path.parent.with_name("input")
    output_dir = code_path.with_name("output").joinpath(code_path.stem)
    output_dir.mkdir(exist_ok=True, parents=True)
    # Load datasets
    data_loader = cs.DataLoader(input_dir)
    jhu_data = data_loader.jhu()
    population_data = data_loader.population()
    # Set country name
    country = "USA"
    abbr = "usa"
    figpath = functools.partial(
        filepath, output_dir=output_dir, country=abbr, ext="jpg")
    # Start scenario analysis
    snl = cs.Scenario(jhu_data, population_data, country)
    # Show records
    record_df = snl.records(filename=figpath("records"))
    save_df(record_df, "records", output_dir, abbr, use_index=False)
    # Show S-R trend
    snl.trend(filename=figpath("trend"))
    print(snl.summary())

    # Parameter estimation
    snl.estimate(cs.SIRF)
    snl.history("Rt", filename=figpath("history_rt_past"))
    # Add future phase to main scenario
    snl.add(name="Main", days=7)
    # Simulation of the number of cases
    sim_df = snl.simulate(name="Main", filename=figpath("simulate"))
    save_df(sim_df, "simulate", output_dir, abbr, use_index=False)
    # Parameter history
    for item in ["Rt", "rho", "sigma", "Infected"]:
        snl.history(item, filename=figpath(f"history_{item.lower()}"))
    # Change rate of parameters in main scenario
    snl.history_rate(name="Main", filename=figpath("history_rate"))
    snl.history_rate(
        params=["kappa", "sigma", "rho"],
        name="Main", filename=figpath("history_rate_without_theta"))
    # Save summary as a CSV file
    summary_df = snl.summary()
    save_df(summary_df, "summary", output_dir, abbr)

def filepath(name, output_dir, country, ext="jpg"):
    """
    Return filepath of a figure.
    Args:
        name (str): name of the figure
        output_dir (pathlib.Path): path of the directory to save the figure
        country (str): country name or abbr
        ext (str, optional): Extension of the output file. Defaults to "jpg".
    Returns:
        pathlib.Path: filepath of the output file
    """
    return output_dir.joinpath(f"{country}_{name}.{ext}")

def save_df(df, name, output_dir, country, use_index=True):
    """
    Save the dataframe as a CSV file.
    Args:
        df (pandas.DataFrame): dataframe
        name (str): name of the dataframe
        output_dir (pathlib.Path): path of the directory to save the figure
        country (str): country name or abbr
        use_index (bool): if True, include index
    """
    df.to_csv(output_dir.joinpath(f"{name}.csv"), index=use_index)

if __name__ == "__main__":
    main()

[UPDATE]: After many tests, I conclude that there might be some problems with the imports I use and they somehow cause this error, I don't know why, it is very frustrating! What do you think? Can your try to run the above code? One definitive test for you to reproduce it, is to add into your scenario_analysis_usa.py code, right before the main(), these imports instead of your only 4 ones:

from collections import defaultdict
from datetime import timedelta
from dateutil.relativedelta import relativedelta
# from IPython.display import display, Markdown
import math
import os
from pprint import pprint
# import matplotlib.pyplot as plt
# import matplotlib.cm as cm
# import matplotlib
# from matplotlib.ticker import ScalarFormatter
import numpy as np
import pandas as pd
import dask.dataframe as dd
import seaborn as sns
import scipy as sci
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import sympy as sym

import functools
from pathlib import Path
import warnings
import covsirphy as cs

I did just that to your scenario_analysis_usa.py script and it leads to the same error: usa_simulate

Inglezos commented 4 years ago

[MAJOR UPDATE!!!]: What causes the weird triangular death results in my code is the following two imports (I took them from the kaggle notebook): import seaborn as sns import sympy as sym When they are both and only then, it is created I guess some kind of conflict, please further investigate! Nevertheless, even in that way, the simulated cases are very wrong, especially for the deaths. Am I doing something wrong? Why the notebook's piece of code, which I am running, doesn't seem to be working properly? I am using Spyder in Windows. I noticed different results when i) I run code as lines selection line-by-line and ii) when I use a def main(): with all my code and then a
```
if __name__ == "__main__":
main()
```
Could MPEstimator, which uses multi-cores processing cause a conflict in windows, for parallel threads? Is multiprocessing problematic in interactive interpreter for windows?

lisphilar commented 4 years ago

I tried scenario_analysis_usa.py code with importing seaborn and sympy without def main() for 5 times, but I could not reproduce the weired result. Please kindly share the output of Scenario.summary() to see RMSLE scores and runtime.

Inglezos commented 4 years ago

I will post these results too when I get home from work. Are you running this in windows and in interactive console? I think that probably the root cause is in the multiprocessing.

lisphilar commented 4 years ago

I ran this code in

local environment: Windows subsystem for Linux / Ubuntu (8 CPUs)
Google Colaboratory

Inglezos commented 4 years ago

And what plots do you get for USA? Is it possible for you to try it also in Spyder?

lisphilar commented 4 years ago

Please see my codes and outputs with Google Colab (2 CPUs, Python 3.6.9) on https://gist.github.com/lisphilar/4d9dbc5ea39ef808e2139e2fd4b655f4

I have not installed and used Spyder, but I will try it tomorrow. I'm using Bash in WSL and VS code for coding and running codes.

Could you try Scenario.estimate(cs.SIRF, n_jobs=1) to stop multiprocessing?

Inglezos commented 4 years ago

Have you tried Spyder? I tried to use n_jobs=1; sometimes it runs smoothly (after a long long time), but usually it has similar results: usa_custom_weird_deaths1 I also run your single py file and had this result: usa_custom_weird_deaths2

which is very very wrong about the deaths simulated cases at the end (the quickly rising line):

21-10-20	8401849	224660	4869492	3307697
22-10-20	8474386	225862	4918341	3330183
23-10-20	8547633	227075	4967664	3352894
24-10-20	8621600	228301	5017465	3375834
25-10-20	8696290	229539	5067748	3399003
26-10-20	8704446	225714	5018277	3460455
27-10-20	8785132	249457	5049937	3485738
28-10-20	8866306	273344	5081782	3511180
29-10-20	8947973	297376	5113814	3536783
30-10-20	9030133	321553	5146033	3562547
31-10-20	9112790	345876	5178441	3588473
01-11-20	9195945	370346	5211037	3614562
02-11-20	9279604	394964	5243824	3640816
03-11-20	9363767	419731	5276801	3667235
04-11-20	9448437	444646	5309971	3693820
05-11-20	9533617	469712	5343333	3720572
06-11-20	9619310	494929	5376889	3747492
07-11-20	9705518	520297	5410640	3774581
08-11-20	9792244	545818	5444586	3801840
09-11-20	9879491	571492	5478729	3829270
10-11-20	9967261	597320	5513069	3856872
11-11-20	10055557	623303	5547607	3884647
12-11-20	10144383	649442	5582345	3912596

lisphilar commented 4 years ago

I installed Spyder4 and Python 3.8 with Anaconda3 in WIndows OS for the first time. Then, I install Covsirphy version 2.10.0 (codes are the same as 2.9.1-theta) and ran the next codes.

import seaborn as sns
import sympy as sym
import covsirphy as cs

cs.__version__

# Load datasets
data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu(verbose=False)
population_data = data_loader.population(verbose=False)
# Set country name
country = "United States"
abbr = "usa"
# Start scenario analysis
snl = cs.Scenario(jhu_data, population_data, country)

# Show records
snl.records().tail()

# Show S-R trend
snl.trend().summary()

# Parameter estimation
snl.estimate(cs.SIRF)
snl.summary()

# Add future phase to main scenario
snl.add(name="Main", days=7)
# Simulation of the number of cases
snl.simulate(name="Main").tail()

However, parameter estimation in phases except for 11th phase was not completed for > 10 min.

lisphilar commented 4 years ago

I'm trying a solution with https://stackoverflow.com/questions/48078722/no-multiprocessing-print-outputs-spyder (Run > Configuration per file > Execute in an external system terminal), but the codes do not finish.

lisphilar commented 4 years ago

Is it necessary to upload CovsirPhy package to Anaconda?

Inglezos commented 4 years ago

Is it necessary to upload CovsirPhy package to Anaconda?

I don't think so, you can run inside a conda environment console "pip install covsirphy" first and then run from there spyder (at least that's what I am doing)

lisphilar commented 4 years ago

I did pip installation in conda console and get the error I mentioned in the above. Are the other settings necessary for multiprocessing in Spyder?

Or, please share the output of Scenario.summary() to discuss runtime and the number of trials.

Inglezos commented 4 years ago

As far as I know, no. The multiprocessing runs automatically, unless specified otherwise. But as I said earlier, it may lead to problematic results inside an interactive python console, such as Spyder's.

However, parameter estimation in phases except for 11th phase was not completed for > 10 min.

How did you determine that? Oh and I thought you were referring to an error! This displayed message is normal behavior I guess, only the last phase result is printed out in the console! This happens to my executions as well all the time for multiprocessing. But this is not an error I think. My result with multiprocessing is:
<SIR-F model: parameter estimation>
Running optimization with 12 CPUs...
11th phase (27Oct2020 - 06Nov2020): finished  630 trials in 1 min  1 sec
Completed optimization. Total: 2 min  9 sec
Scenario.summary() result after having executed the estimator with multiprocessing: Type Start End Population ODE Rt theta kappa rho sigma tau 1/alpha2 [day] 1/gamma [day] 1/beta [day] alpha1 [-] RMSLE Trials Runtime

0th Past 09Feb2020 02May2020 336218660 SIR-F 9.94 0.032099047 0.000876828 0.046035181 0.003606238 1440 1140 277 21 0.032 19.78753647 360 1 min 1 sec

1st Past 03May2020 15Jun2020 336218660 SIR-F 2.13 0.021942935 0.00080148 0.019559868 0.008189574 1440 1247 122 51 0.022 0.202848849 356 1 min 1 sec

2nd Past 16Jun2020 05Jul2020 336218660 SIR-F 3.56 0.003325108 0.000327093 0.023472848 0.006241644 1440 3057 160 42 0.003 0.029449712 356 1 min 1 sec

3rd Past 06Jul2020 17Jul2020 336218660 SIR-F 4.23 0.002637798 0.000301307 0.030498875 0.006882155 1440 3318 145 32 0.003 0.009950549 169 0 min 35 sec

4th Past 18Jul2020 29Jul2020 336218660 SIR-F 2.56 0.003573687 0.000301057 0.025239216 0.009510954 1440 3321 105 39 0.004 0.010416159 173 0 min 35 sec

5th Past 30Jul2020 11Aug2020 336218660 SIR-F 2.07 0.018398421 4.06E-05 0.016347777 0.007706156 1440 24651 129 61 0.018 0.011220036 269 0 min 51 sec

6th Past 12Aug2020 27Aug2020 336218660 SIR-F 2.11 0.002041556 0.000191442 0.013334141 0.006106795 1440 5223 163 74 0.002 0.014749519 203 0 min 41 sec

7th Past 28Aug2020 14Sep2020 336218660 SIR-F 1.74 0.000299137 0.000230083 0.010281454 0.005672952 1440 4346 176 97 0 0.00709466 360 1 min 1 sec

8th Past 15Sep2020 30Sep2020 336218660 SIR-F 1.8 0.017403944 4.06E-05 0.010370716 0.00561613 1440 24651 178 96 0.017 0.007072636 265 0 min 51 sec

9th Past 01Oct2020 15Oct2020 336218660 SIR-F 2.46 0.00180556 9.35E-05 0.013093239 0.005216354 1440 10700 191 76 0.002 0.012906504 150 0 min 30 sec

10th Past 16Oct2020 26Oct2020 336218660 SIR-F 2.58 0.003859091 5.92E-05 0.013516793 0.005165612 1440 16887 193 73 0.004 0.004996453 228 0 min 45 sec

11th Past 27Oct2020 06Nov2020 336218660 SIR-F 2.94 0.014617741 3.17E-05 0.017082992 0.005688771 1440 31503 175 58 0.015 0.010969537 630 1 min 1 sec

`Scenario.summary()` result after having executed the estimator with multiprocessing:		Type	Start	End	Population	ODE	Rt	theta	kappa	rho	sigma	tau	1/alpha2 [day]	1/gamma [day]	1/beta [day]	alpha1 [-]	RMSLE	Trials	Runtime
0th	Past	09Feb2020	02May2020	336218660	SIR-F	9.94	0.032099047	0.000876828	0.046035181	0.003606238	1440	1140	277	21	0.032	19.78753647	360	1 min 1 sec
1st	Past	03May2020	15Jun2020	336218660	SIR-F	2.13	0.021942935	0.00080148	0.019559868	0.008189574	1440	1247	122	51	0.022	0.202848849	356	1 min 1 sec
2nd	Past	16Jun2020	05Jul2020	336218660	SIR-F	3.56	0.003325108	0.000327093	0.023472848	0.006241644	1440	3057	160	42	0.003	0.029449712	356	1 min 1 sec
3rd	Past	06Jul2020	17Jul2020	336218660	SIR-F	4.23	0.002637798	0.000301307	0.030498875	0.006882155	1440	3318	145	32	0.003	0.009950549	169	0 min 35 sec
4th	Past	18Jul2020	29Jul2020	336218660	SIR-F	2.56	0.003573687	0.000301057	0.025239216	0.009510954	1440	3321	105	39	0.004	0.010416159	173	0 min 35 sec
5th	Past	30Jul2020	11Aug2020	336218660	SIR-F	2.07	0.018398421	4.06E-05	0.016347777	0.007706156	1440	24651	129	61	0.018	0.011220036	269	0 min 51 sec
6th	Past	12Aug2020	27Aug2020	336218660	SIR-F	2.11	0.002041556	0.000191442	0.013334141	0.006106795	1440	5223	163	74	0.002	0.014749519	203	0 min 41 sec
7th	Past	28Aug2020	14Sep2020	336218660	SIR-F	1.74	0.000299137	0.000230083	0.010281454	0.005672952	1440	4346	176	97	0	0.00709466	360	1 min 1 sec
8th	Past	15Sep2020	30Sep2020	336218660	SIR-F	1.8	0.017403944	4.06E-05	0.010370716	0.00561613	1440	24651	178	96	0.017	0.007072636	265	0 min 51 sec
9th	Past	01Oct2020	15Oct2020	336218660	SIR-F	2.46	0.00180556	9.35E-05	0.013093239	0.005216354	1440	10700	191	76	0.002	0.012906504	150	0 min 30 sec
10th	Past	16Oct2020	26Oct2020	336218660	SIR-F	2.58	0.003859091	5.92E-05	0.013516793	0.005165612	1440	16887	193	73	0.004	0.004996453	228	0 min 45 sec
11th	Past	27Oct2020	06Nov2020	336218660	SIR-F	2.94	0.014617741	3.17E-05	0.017082992	0.005688771	1440	31503	175	58	0.015	0.010969537	630	1 min 1 sec

lisphilar commented 4 years ago

I found the number of trials (Trials column) is lower in Spyder/Windows than that in Gogle colaboratory. (1st phase: 356 << 1187) https://gist.github.com/lisphilar/4d9dbc5ea39ef808e2139e2fd4b655f4

We have two solutions.

Change timeout of estimation: Scenario.estimate(cs.SIRF, timeout=60) (default: 60 sec) to Scenario.estimate(cs.SIRF, timeout=120) (120 sec)
Improve covsirphy.simulation.Estimator etc. to cut runtime #291

I'm trying the second solution from Feb2020, but I could not find how to improve it.

Inglezos commented 3 years ago

Did you find any solution for the weird simulated results? usa_simulate spn_simulate

For the USA what bothers me is the triangular curve and for Spain the totally wrong deaths.

Any news for the Spyder problems? What can I do to have more normal results?

lisphilar commented 3 years ago

I could not reproduce your figures with Spyder. I'm trying to improve Estimator class in #270 and #291 .

Please try to

use timeout argument as I mentioned in the previous comment and
run scripts with command line tools in Windows, not Spyder, to find the root cause.

Inglezos commented 3 years ago

Oh yes you are right, the small timeout seems to be the root cause for the weird triangular shape, for some reason. I use now always timeout=120. But why is this needed? I think you don't use 120sec right? How come you can run this correctly with 60sec? Is this platform dependent? Nevertheless, for USA I have this weird infected spike at the beginning, which you don't have: usa_simulate

And for Spain, despite the fact that the curves seem normal, the simuled cases error is big, for confirmed ~20% and for deaths ~6.4% (compared to today's cases, 12Nov20). Can you tell me what are your simulated confirmed and deaths cases for 12Nov20?

lisphilar commented 3 years ago

Required runtime of parameter estimation depends on CPU performance. Max value of runtime is set as timeout argument. With Estimator class, CovsirPhy continues estimation while runtime does not over timeout and max simulated number of cases is not in the range of (max actual number * 0.98, max actual number * 1.02).

I changed default value of timeout from 60 sec to 180 sec with issue #291 and pull request #327.

Please find the results for USA and Spain with Google Colab and version 2.10.0-mu as-of 13Nov2020. https://gist.github.com/lisphilar/6d235d2ca3355e54687ba8cd27815022#file-usa_spain_13nov2020-ipynb

Do you have any ideas to improve Estimator and ChangeFinder.run()? Spikes are raised when accuracy of estimation is low and the number of cases are over-estimated. This is because of normal behaivior that initial values of past phases are fixed to the initial values of actual data of the phases.

Inglezos commented 3 years ago

Yes fortunately I have now the same results with you at last! The timeout parameter is very significant. I have some questions about the results for Spain. What does this line mean in the figure? And most importantly, why the recovered and deaths curves are not strictly monotonically increasing? Should be added a check during the simulation, that the results (deaths, recovered and confirmed) must monotonically increase?

Spikes are raised when accuracy of estimation is low and the number of cases are over-estimated. This is because of normal behaivior that initial values of past phases are fixed to the initial values of actual data of the phases.

Is this related to the initial phase further subdivision problem? And to tackle both issues, do we have to improve ChangeFinder.run()?

lisphilar commented 3 years ago

What does this line mean in the figure?

The line means that there are many phases in the narrow range. This is because the number of infeced cases increased sharply.

why the recovered and deaths curves are not strictly monotonically increasing?

This is because simulation is performed in each phase and combined. For the change dates of phases, values of start dates of next phases will be used. When accuracy of estimation is low and the number of cases are over-estimated, spikes occur and monotonic increasing is broken.

For Spain, I think the first four phases are accurately splited. Estimator should be improved.

Inglezos commented 3 years ago

So in order to overcome these spikes occurrences and ensure monotonic increasing behavior, what should we do? For the simulated cases right before and after the phase change date, could we rework the values and adapt them in such a way that it is monotonically increasing?

lisphilar commented 3 years ago

I do not have concrete ideas, but rework after estimation could be selected as you mentioned.

Check that the previous phase values are in user-defined allowance (0.98, 1.02) of the next phase values.
If not in the allowance, find internal chane dates in the previous phase.
Try parameter estimation again.

lisphilar commented 3 years ago

I updated my idea. New method Scenario.trend_estimate(model, timeout=120, max_iteration=3, **kwargs) performs the following steps.

S-R trend analysis with all records
Parameter estimation in each phase with timeout
S-R trend analysis with the phases in which estimation failed
Parameter estimation in new phases with 'timeout`
Repeat.

lisphilar commented 3 years ago

The results with version 2.13.1-alpha are in https://gist.github.com/lisphilar/8f64aca428a3eb065c8c39ef827b3fc5

We move forward to #354 (and new issues) to improve the accuracy of parameter estimation.

lisphilar / covid19-sir