lisphilar / covid19-sir

CovsirPhy: Python library for COVID-19 analysis with phase-dependent SIR-derived ODE models.
https://lisphilar.github.io/covid19-sir/
Apache License 2.0
109 stars 44 forks source link

[Discuss] USA weird death cases predictions depending on random execution trials #272

Closed Inglezos closed 3 years ago

Inglezos commented 3 years ago

Summary

For USA, the confirmed cases predictions for 24Oct20 are a little bit off by ~250.000, but deaths cases predictions are totally wrong (actual ~229.000, predicted ~455.000 almost doubled). The first time I executed:

Codes and outputs:

   usa_scenario = cs.Scenario(jhu_data, pop_data, "USA")
   usa_scenario.records().tail()
   _ = usa_scenario.trend()
   usa_scenario.summary()
   usa_scenario.estimate(cs.SIRF)
   usa_scenario.summary()
   _ = usa_scenario.history_rate()
   usa_scenario.clear()
   usa_scenario.add(days=7)
   usa_pred = usa_scenario.simulate() -> observe usa_pred

usa_weird_death_predictions_values

I found a weird way to overcome this though. If I run again the trends and the estimator without reinitializing the usascenario object: = usa_scenario.trend() usa_scenario.summary() usa_scenario.estimate(cs.SIRF) usascenario.summary() = usa_scenario.history_rate() usa_scenario.clear() usa_scenario.add(days=7) usa_pred = usa_scenario.simulate() -> observe usa_pred

Then the deaths are very accurate (off only by ~5000).

usa_normal_death_predictions_values

Why is this happening and how can this be solved? Please further investigate.

Also, one time I noticed and a third behavior. On some execution trial the weird death predictions followed the same weird triangular pattern (as in Greece deaths predictions, see #271 and as in Russia/Spain death predictions, see #273). And this time if I rerun the analysis as I said before without reinitializing the scenario object, the error now still persists. It is very weird and frustrating and seems to be random depending on the execution trial.

usa_weird_death_predictions_graph

Environment

lisphilar commented 3 years ago

Please help with me to try Scenario.estimate(cs.SIRF, timeout=120) mentioned in #274 to find the cause.

lisphilar commented 3 years ago

Except for 0th phase (09Feb2020 - 05May2020), I did not find weired results with COVID-19 Data Hub. We need to focus on 0th phase. https://github.com/lisphilar/covid19-sir/blob/master/example/scenario_analysis_usa.py

Actual data: usa_records

Simulated data: usa_simulate

Summary:

  Start End Rt theta kappa rho sigma tau RMSLE Trials Runtime
0th 9-Feb-20 5-May-20 5 0.00102 0.001213 0.021878 0.003155 720 20.79586 1030 1 min  0 sec
1st 6-May-20 14-Jun-20 2.1 0.032899 0.000237 0.009478 0.004134 720 0.190943 1054 1 min  0 sec
2nd 15-Jun-20 5-Jul-20 3.62 0.001569 0.000191 0.01195 0.003101 720 0.036643 1060 1 min  0 sec
3rd 6-Jul-20 18-Jul-20 4.39 0.009637 1.43E-05 0.015224 0.003423 720 0.01313 232 0 min 15 sec
4th 19-Jul-20 29-Jul-20 2.43 0.001015 0.00015 0.012798 0.005119 720 0.008411 229 0 min 15 sec
5th 30-Jul-20 12-Aug-20 2.11 0.016288 3.93E-05 0.008762 0.004049 720 0.00849 531 0 min 35 sec
6th 13-Aug-20 27-Aug-20 1.8 0.020675 1.42E-05 0.006329 0.003422 720 0.007471 226 0 min 15 sec
7th 28-Aug-20 13-Sep-20 1.62 0.003562 7.91E-05 0.005084 0.003047 720 0.008261 147 0 min 10 sec
8th 14-Sep-20 29-Sep-20 1.66 0.003181 9.62E-05 0.005095 0.002954 720 0.010162 312 0 min 20 sec
9th 30-Sep-20 12-Oct-20 2.34 0.001674 5.43E-05 0.005576 0.002324 720 0.006276 156 0 min 10 sec
10th 13-Oct-20 22-Oct-20 2.39 0.004536 9.26E-05 0.006864 0.002772 720 0.006384 159 0 min 10 sec
11th 23-Oct-20 1-Nov-20 2.57 0.000555 0.000111 0.007181 0.00268 720 0.007266 756 0 min 30 sec
Inglezos commented 3 years ago

That's interesting, because if I run USA analysis, not by that script, but using the following lines (which do essentially the same thing I guess since they are taken from the kaggle notebook analysis), it shows very weird and wrong results (I run with version CovsirPhy v2.9.1-delta.new.290.fix.282.283.296):

import covsirphy as cs

from collections import defaultdict
from datetime import timedelta
from dateutil.relativedelta import relativedelta
import functools
from IPython.display import display, Markdown
import math
import os
from pprint import pprint
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib
from matplotlib.ticker import ScalarFormatter
import numpy as np
import pandas as pd
import dask.dataframe as dd
import seaborn as sns
import scipy as sci
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import sympy as sym

def main():
    cs.get_version()
    get_ipython().run_line_magic('matplotlib', 'inline')
    pd.plotting.register_matplotlib_converters()

    np.random.seed(123)
    os.environ["PYTHONHASHSEED"] = "123"
    # Matplotlib
    plt.style.use("seaborn-ticks")
    plt.rcParams["xtick.direction"] = "in"
    plt.rcParams["ytick.direction"] = "in"
    plt.rcParams["font.size"] = 11.0
    plt.rcParams["figure.figsize"] = (9, 6)
    # Pandas
    pd.set_option("display.max_colwidth", 1000)

    data_loader = cs.DataLoader(directory="kaggle/input")
    jhu_data = data_loader.jhu()
    population_data = data_loader.population()
    usa_scenario = cs.Scenario(jhu_data, population_data, "USA")
    usa_scenario.records().tail()
    _ = usa_scenario.trend()
    usa_scenario.summary()
    usa_scenario.estimate(cs.SIRF)
    usa_scenario.summary()
    _ = usa_scenario.history_rate()
    usa_scenario.clear()
    usa_scenario.add(days=7)
    usa_scenario.simulate().tail(7).style.background_gradient(axis=0)

if __name__ == "__main__":
    main()

usa_custom_weird_analysis1 usa_custom_weird_analysis2 usa_custom_weird_analysis3 usa_custom_weird_analysis4

What is going on? Is this an Estimator class bug/problem? Or am I doing something very very wrong??? Is there something wrong in the code I try? Can you run the same code by yourself?

If I run your code then I get your normal results, except for one time when I got similar results to mine. Why is this happening? You are saving the results in CSVs and figures locally, while I am printing them inline in Spyder console with matplotlib and I have more #includes? Could any of that be the problem? It would be weird though. Could it be random and happens indeed on random execution trials for some unexplained reason?

For example, even that code fails:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from collections import defaultdict
from datetime import timedelta
from dateutil.relativedelta import relativedelta
import functools
from IPython.display import display, Markdown
import math
import os
from pprint import pprint
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib
from matplotlib.ticker import ScalarFormatter
import numpy as np
import pandas as pd
import dask.dataframe as dd
import seaborn as sns
import scipy as sci
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import sympy as sym

from pathlib import Path
import warnings
import covsirphy as cs

def main():
    # cs.get_version()
    # get_ipython().run_line_magic('matplotlib', 'inline')
    # pd.plotting.register_matplotlib_converters()

    # # Matplotlib
    # plt.style.use("seaborn-ticks")
    # plt.rcParams["xtick.direction"] = "in"
    # plt.rcParams["ytick.direction"] = "in"
    # plt.rcParams["font.size"] = 11.0
    # plt.rcParams["figure.figsize"] = (9, 6)

    warnings.simplefilter("error")
    # Create output directory in example directory
    code_path = Path(__file__)
    input_dir = code_path.parent.with_name("input")
    output_dir = code_path.with_name("output").joinpath(code_path.stem)
    output_dir.mkdir(exist_ok=True, parents=True)
    # Load datasets
    data_loader = cs.DataLoader(input_dir)
    jhu_data = data_loader.jhu()
    population_data = data_loader.population()
    # Set country name
    country = "USA"
    abbr = "usa"
    figpath = functools.partial(
        filepath, output_dir=output_dir, country=abbr, ext="jpg")
    # Start scenario analysis
    snl = cs.Scenario(jhu_data, population_data, country)
    # Show records
    record_df = snl.records(filename=figpath("records"))
    save_df(record_df, "records", output_dir, abbr, use_index=False)
    # Show S-R trend
    snl.trend(filename=figpath("trend"))
    print(snl.summary())

    # Parameter estimation
    snl.estimate(cs.SIRF)
    snl.history("Rt", filename=figpath("history_rt_past"))
    # Add future phase to main scenario
    snl.add(name="Main", days=7)
    # Simulation of the number of cases
    sim_df = snl.simulate(name="Main", filename=figpath("simulate"))
    save_df(sim_df, "simulate", output_dir, abbr, use_index=False)
    # Parameter history
    for item in ["Rt", "rho", "sigma", "Infected"]:
        snl.history(item, filename=figpath(f"history_{item.lower()}"))
    # Change rate of parameters in main scenario
    snl.history_rate(name="Main", filename=figpath("history_rate"))
    snl.history_rate(
        params=["kappa", "sigma", "rho"],
        name="Main", filename=figpath("history_rate_without_theta"))
    # Save summary as a CSV file
    summary_df = snl.summary()
    save_df(summary_df, "summary", output_dir, abbr)

def filepath(name, output_dir, country, ext="jpg"):
    """
    Return filepath of a figure.
    Args:
        name (str): name of the figure
        output_dir (pathlib.Path): path of the directory to save the figure
        country (str): country name or abbr
        ext (str, optional): Extension of the output file. Defaults to "jpg".
    Returns:
        pathlib.Path: filepath of the output file
    """
    return output_dir.joinpath(f"{country}_{name}.{ext}")

def save_df(df, name, output_dir, country, use_index=True):
    """
    Save the dataframe as a CSV file.
    Args:
        df (pandas.DataFrame): dataframe
        name (str): name of the dataframe
        output_dir (pathlib.Path): path of the directory to save the figure
        country (str): country name or abbr
        use_index (bool): if True, include index
    """
    df.to_csv(output_dir.joinpath(f"{name}.csv"), index=use_index)

if __name__ == "__main__":
    main()

[UPDATE]: After many tests, I conclude that there might be some problems with the imports I use and they somehow cause this error, I don't know why, it is very frustrating! What do you think? Can your try to run the above code? One definitive test for you to reproduce it, is to add into your scenario_analysis_usa.py code, right before the main(), these imports instead of your only 4 ones:

from collections import defaultdict
from datetime import timedelta
from dateutil.relativedelta import relativedelta
# from IPython.display import display, Markdown
import math
import os
from pprint import pprint
# import matplotlib.pyplot as plt
# import matplotlib.cm as cm
# import matplotlib
# from matplotlib.ticker import ScalarFormatter
import numpy as np
import pandas as pd
import dask.dataframe as dd
import seaborn as sns
import scipy as sci
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import sympy as sym

import functools
from pathlib import Path
import warnings
import covsirphy as cs

I did just that to your scenario_analysis_usa.py script and it leads to the same error: usa_simulate

Inglezos commented 3 years ago
lisphilar commented 3 years ago

I tried scenario_analysis_usa.py code with importing seaborn and sympy without def main() for 5 times, but I could not reproduce the weired result. Please kindly share the output of Scenario.summary() to see RMSLE scores and runtime.

Inglezos commented 3 years ago

I will post these results too when I get home from work. Are you running this in windows and in interactive console? I think that probably the root cause is in the multiprocessing.

lisphilar commented 3 years ago

I ran this code in

  1. local environment: Windows subsystem for Linux / Ubuntu (8 CPUs)
  2. Google Colaboratory
Inglezos commented 3 years ago

And what plots do you get for USA? Is it possible for you to try it also in Spyder?

lisphilar commented 3 years ago

Please see my codes and outputs with Google Colab (2 CPUs, Python 3.6.9) on https://gist.github.com/lisphilar/4d9dbc5ea39ef808e2139e2fd4b655f4

I have not installed and used Spyder, but I will try it tomorrow. I'm using Bash in WSL and VS code for coding and running codes.

Could you try Scenario.estimate(cs.SIRF, n_jobs=1) to stop multiprocessing?

Inglezos commented 3 years ago

Have you tried Spyder? I tried to use n_jobs=1; sometimes it runs smoothly (after a long long time), but usually it has similar results: usa_custom_weird_deaths1 I also run your single py file and had this result: usa_custom_weird_deaths2

which is very very wrong about the deaths simulated cases at the end (the quickly rising line):

21-10-20 8401849 224660 4869492 3307697
22-10-20 8474386 225862 4918341 3330183
23-10-20 8547633 227075 4967664 3352894
24-10-20 8621600 228301 5017465 3375834
25-10-20 8696290 229539 5067748 3399003
26-10-20 8704446 225714 5018277 3460455
27-10-20 8785132 249457 5049937 3485738
28-10-20 8866306 273344 5081782 3511180
29-10-20 8947973 297376 5113814 3536783
30-10-20 9030133 321553 5146033 3562547
31-10-20 9112790 345876 5178441 3588473
01-11-20 9195945 370346 5211037 3614562
02-11-20 9279604 394964 5243824 3640816
03-11-20 9363767 419731 5276801 3667235
04-11-20 9448437 444646 5309971 3693820
05-11-20 9533617 469712 5343333 3720572
06-11-20 9619310 494929 5376889 3747492
07-11-20 9705518 520297 5410640 3774581
08-11-20 9792244 545818 5444586 3801840
09-11-20 9879491 571492 5478729 3829270
10-11-20 9967261 597320 5513069 3856872
11-11-20 10055557 623303 5547607 3884647
12-11-20 10144383 649442 5582345 3912596
lisphilar commented 3 years ago

I installed Spyder4 and Python 3.8 with Anaconda3 in WIndows OS for the first time. Then, I install Covsirphy version 2.10.0 (codes are the same as 2.9.1-theta) and ran the next codes.

import seaborn as sns
import sympy as sym
import covsirphy as cs

cs.__version__

# Load datasets
data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu(verbose=False)
population_data = data_loader.population(verbose=False)
# Set country name
country = "United States"
abbr = "usa"
# Start scenario analysis
snl = cs.Scenario(jhu_data, population_data, country)

# Show records
snl.records().tail()

# Show S-R trend
snl.trend().summary()

# Parameter estimation
snl.estimate(cs.SIRF)
snl.summary()

# Add future phase to main scenario
snl.add(name="Main", days=7)
# Simulation of the number of cases
snl.simulate(name="Main").tail()

However, parameter estimation in phases except for 11th phase was not completed for > 10 min. image

lisphilar commented 3 years ago

I'm trying a solution with https://stackoverflow.com/questions/48078722/no-multiprocessing-print-outputs-spyder (Run > Configuration per file > Execute in an external system terminal), but the codes do not finish.

lisphilar commented 3 years ago

Is it necessary to upload CovsirPhy package to Anaconda?

Inglezos commented 3 years ago

Is it necessary to upload CovsirPhy package to Anaconda?

I don't think so, you can run inside a conda environment console "pip install covsirphy" first and then run from there spyder (at least that's what I am doing)

lisphilar commented 3 years ago

I did pip installation in conda console and get the error I mentioned in the above. Are the other settings necessary for multiprocessing in Spyder?

Or, please share the output of Scenario.summary() to discuss runtime and the number of trials.

Inglezos commented 3 years ago

As far as I know, no. The multiprocessing runs automatically, unless specified otherwise. But as I said earlier, it may lead to problematic results inside an interactive python console, such as Spyder's.

However, parameter estimation in phases except for 11th phase was not completed for > 10 min.

How did you determine that? Oh and I thought you were referring to an error! This displayed message is normal behavior I guess, only the last phase result is printed out in the console! This happens to my executions as well all the time for multiprocessing. But this is not an error I think. My result with multiprocessing is:

<SIR-F model: parameter estimation>
Running optimization with 12 CPUs...
11th phase (27Oct2020 - 06Nov2020): finished  630 trials in 1 min  1 sec
Completed optimization. Total: 2 min  9 sec
Scenario.summary() result after having executed the estimator with multiprocessing: Type Start End Population ODE Rt theta kappa rho sigma tau 1/alpha2 [day] 1/gamma [day] 1/beta [day] alpha1 [-] RMSLE Trials Runtime
0th Past 09Feb2020 02May2020 336218660 SIR-F 9.94 0.032099047 0.000876828 0.046035181 0.003606238 1440 1140 277 21 0.032 19.78753647 360 1 min 1 sec
1st Past 03May2020 15Jun2020 336218660 SIR-F 2.13 0.021942935 0.00080148 0.019559868 0.008189574 1440 1247 122 51 0.022 0.202848849 356 1 min 1 sec
2nd Past 16Jun2020 05Jul2020 336218660 SIR-F 3.56 0.003325108 0.000327093 0.023472848 0.006241644 1440 3057 160 42 0.003 0.029449712 356 1 min 1 sec
3rd Past 06Jul2020 17Jul2020 336218660 SIR-F 4.23 0.002637798 0.000301307 0.030498875 0.006882155 1440 3318 145 32 0.003 0.009950549 169 0 min 35 sec
4th Past 18Jul2020 29Jul2020 336218660 SIR-F 2.56 0.003573687 0.000301057 0.025239216 0.009510954 1440 3321 105 39 0.004 0.010416159 173 0 min 35 sec
5th Past 30Jul2020 11Aug2020 336218660 SIR-F 2.07 0.018398421 4.06E-05 0.016347777 0.007706156 1440 24651 129 61 0.018 0.011220036 269 0 min 51 sec
6th Past 12Aug2020 27Aug2020 336218660 SIR-F 2.11 0.002041556 0.000191442 0.013334141 0.006106795 1440 5223 163 74 0.002 0.014749519 203 0 min 41 sec
7th Past 28Aug2020 14Sep2020 336218660 SIR-F 1.74 0.000299137 0.000230083 0.010281454 0.005672952 1440 4346 176 97 0 0.00709466 360 1 min 1 sec
8th Past 15Sep2020 30Sep2020 336218660 SIR-F 1.8 0.017403944 4.06E-05 0.010370716 0.00561613 1440 24651 178 96 0.017 0.007072636 265 0 min 51 sec
9th Past 01Oct2020 15Oct2020 336218660 SIR-F 2.46 0.00180556 9.35E-05 0.013093239 0.005216354 1440 10700 191 76 0.002 0.012906504 150 0 min 30 sec
10th Past 16Oct2020 26Oct2020 336218660 SIR-F 2.58 0.003859091 5.92E-05 0.013516793 0.005165612 1440 16887 193 73 0.004 0.004996453 228 0 min 45 sec
11th Past 27Oct2020 06Nov2020 336218660 SIR-F 2.94 0.014617741 3.17E-05 0.017082992 0.005688771 1440 31503 175 58 0.015 0.010969537 630 1 min 1 sec
lisphilar commented 3 years ago

I found the number of trials (Trials column) is lower in Spyder/Windows than that in Gogle colaboratory. (1st phase: 356 << 1187) https://gist.github.com/lisphilar/4d9dbc5ea39ef808e2139e2fd4b655f4

We have two solutions.

  1. Change timeout of estimation: Scenario.estimate(cs.SIRF, timeout=60) (default: 60 sec) to Scenario.estimate(cs.SIRF, timeout=120) (120 sec)
  2. Improve covsirphy.simulation.Estimator etc. to cut runtime #291

I'm trying the second solution from Feb2020, but I could not find how to improve it.

Inglezos commented 3 years ago

Did you find any solution for the weird simulated results? usa_simulate spn_simulate

For the USA what bothers me is the triangular curve and for Spain the totally wrong deaths.

Any news for the Spyder problems? What can I do to have more normal results?

lisphilar commented 3 years ago

I could not reproduce your figures with Spyder. I'm trying to improve Estimator class in #270 and #291 .

Please try to

Inglezos commented 3 years ago

Oh yes you are right, the small timeout seems to be the root cause for the weird triangular shape, for some reason. I use now always timeout=120. But why is this needed? I think you don't use 120sec right? How come you can run this correctly with 60sec? Is this platform dependent? Nevertheless, for USA I have this weird infected spike at the beginning, which you don't have: usa_simulate

And for Spain, despite the fact that the curves seem normal, the simuled cases error is big, for confirmed ~20% and for deaths ~6.4% (compared to today's cases, 12Nov20). Can you tell me what are your simulated confirmed and deaths cases for 12Nov20?

lisphilar commented 3 years ago

Required runtime of parameter estimation depends on CPU performance. Max value of runtime is set as timeout argument. With Estimator class, CovsirPhy continues estimation while runtime does not over timeout and max simulated number of cases is not in the range of (max actual number * 0.98, max actual number * 1.02).

I changed default value of timeout from 60 sec to 180 sec with issue #291 and pull request #327.

Please find the results for USA and Spain with Google Colab and version 2.10.0-mu as-of 13Nov2020. https://gist.github.com/lisphilar/6d235d2ca3355e54687ba8cd27815022#file-usa_spain_13nov2020-ipynb

Do you have any ideas to improve Estimator and ChangeFinder.run()? Spikes are raised when accuracy of estimation is low and the number of cases are over-estimated. This is because of normal behaivior that initial values of past phases are fixed to the initial values of actual data of the phases.

Inglezos commented 3 years ago

Yes fortunately I have now the same results with you at last! The timeout parameter is very significant. I have some questions about the results for Spain. What does this line mean in the figure? image And most importantly, why the recovered and deaths curves are not strictly monotonically increasing? Should be added a check during the simulation, that the results (deaths, recovered and confirmed) must monotonically increase? image

Spikes are raised when accuracy of estimation is low and the number of cases are over-estimated. This is because of normal behaivior that initial values of past phases are fixed to the initial values of actual data of the phases.

Is this related to the initial phase further subdivision problem? And to tackle both issues, do we have to improve ChangeFinder.run()?

lisphilar commented 3 years ago

What does this line mean in the figure?

The line means that there are many phases in the narrow range. This is because the number of infeced cases increased sharply.

why the recovered and deaths curves are not strictly monotonically increasing?

This is because simulation is performed in each phase and combined. For the change dates of phases, values of start dates of next phases will be used. When accuracy of estimation is low and the number of cases are over-estimated, spikes occur and monotonic increasing is broken.

For Spain, I think the first four phases are accurately splited. Estimator should be improved.

Inglezos commented 3 years ago

So in order to overcome these spikes occurrences and ensure monotonic increasing behavior, what should we do? For the simulated cases right before and after the phase change date, could we rework the values and adapt them in such a way that it is monotonically increasing?

lisphilar commented 3 years ago

I do not have concrete ideas, but rework after estimation could be selected as you mentioned.

lisphilar commented 3 years ago

I updated my idea. New method Scenario.trend_estimate(model, timeout=120, max_iteration=3, **kwargs) performs the following steps.

lisphilar commented 3 years ago

The results with version 2.13.1-alpha are in https://gist.github.com/lisphilar/8f64aca428a3eb065c8c39ef827b3fc5

We move forward to #354 (and new issues) to improve the accuracy of parameter estimation.