Closed Inglezos closed 3 years ago
Please help with me to try Scenario.estimate(cs.SIRF, timeout=120)
mentioned in #274 to find the cause.
Except for 0th phase (09Feb2020 - 05May2020), I did not find weired results with COVID-19 Data Hub. We need to focus on 0th phase. https://github.com/lisphilar/covid19-sir/blob/master/example/scenario_analysis_usa.py
Actual data:
Simulated data:
Summary:
Start | End | Rt | theta | kappa | rho | sigma | tau | RMSLE | Trials | Runtime | |
---|---|---|---|---|---|---|---|---|---|---|---|
0th | 9-Feb-20 | 5-May-20 | 5 | 0.00102 | 0.001213 | 0.021878 | 0.003155 | 720 | 20.79586 | 1030 | 1 min 0 sec |
1st | 6-May-20 | 14-Jun-20 | 2.1 | 0.032899 | 0.000237 | 0.009478 | 0.004134 | 720 | 0.190943 | 1054 | 1 min 0 sec |
2nd | 15-Jun-20 | 5-Jul-20 | 3.62 | 0.001569 | 0.000191 | 0.01195 | 0.003101 | 720 | 0.036643 | 1060 | 1 min 0 sec |
3rd | 6-Jul-20 | 18-Jul-20 | 4.39 | 0.009637 | 1.43E-05 | 0.015224 | 0.003423 | 720 | 0.01313 | 232 | 0 min 15 sec |
4th | 19-Jul-20 | 29-Jul-20 | 2.43 | 0.001015 | 0.00015 | 0.012798 | 0.005119 | 720 | 0.008411 | 229 | 0 min 15 sec |
5th | 30-Jul-20 | 12-Aug-20 | 2.11 | 0.016288 | 3.93E-05 | 0.008762 | 0.004049 | 720 | 0.00849 | 531 | 0 min 35 sec |
6th | 13-Aug-20 | 27-Aug-20 | 1.8 | 0.020675 | 1.42E-05 | 0.006329 | 0.003422 | 720 | 0.007471 | 226 | 0 min 15 sec |
7th | 28-Aug-20 | 13-Sep-20 | 1.62 | 0.003562 | 7.91E-05 | 0.005084 | 0.003047 | 720 | 0.008261 | 147 | 0 min 10 sec |
8th | 14-Sep-20 | 29-Sep-20 | 1.66 | 0.003181 | 9.62E-05 | 0.005095 | 0.002954 | 720 | 0.010162 | 312 | 0 min 20 sec |
9th | 30-Sep-20 | 12-Oct-20 | 2.34 | 0.001674 | 5.43E-05 | 0.005576 | 0.002324 | 720 | 0.006276 | 156 | 0 min 10 sec |
10th | 13-Oct-20 | 22-Oct-20 | 2.39 | 0.004536 | 9.26E-05 | 0.006864 | 0.002772 | 720 | 0.006384 | 159 | 0 min 10 sec |
11th | 23-Oct-20 | 1-Nov-20 | 2.57 | 0.000555 | 0.000111 | 0.007181 | 0.00268 | 720 | 0.007266 | 756 | 0 min 30 sec |
That's interesting, because if I run USA analysis, not by that script, but using the following lines (which do essentially the same thing I guess since they are taken from the kaggle notebook analysis), it shows very weird and wrong results (I run with version CovsirPhy v2.9.1-delta.new.290.fix.282.283.296
):
import covsirphy as cs
from collections import defaultdict
from datetime import timedelta
from dateutil.relativedelta import relativedelta
import functools
from IPython.display import display, Markdown
import math
import os
from pprint import pprint
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib
from matplotlib.ticker import ScalarFormatter
import numpy as np
import pandas as pd
import dask.dataframe as dd
import seaborn as sns
import scipy as sci
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import sympy as sym
def main():
cs.get_version()
get_ipython().run_line_magic('matplotlib', 'inline')
pd.plotting.register_matplotlib_converters()
np.random.seed(123)
os.environ["PYTHONHASHSEED"] = "123"
# Matplotlib
plt.style.use("seaborn-ticks")
plt.rcParams["xtick.direction"] = "in"
plt.rcParams["ytick.direction"] = "in"
plt.rcParams["font.size"] = 11.0
plt.rcParams["figure.figsize"] = (9, 6)
# Pandas
pd.set_option("display.max_colwidth", 1000)
data_loader = cs.DataLoader(directory="kaggle/input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()
usa_scenario = cs.Scenario(jhu_data, population_data, "USA")
usa_scenario.records().tail()
_ = usa_scenario.trend()
usa_scenario.summary()
usa_scenario.estimate(cs.SIRF)
usa_scenario.summary()
_ = usa_scenario.history_rate()
usa_scenario.clear()
usa_scenario.add(days=7)
usa_scenario.simulate().tail(7).style.background_gradient(axis=0)
if __name__ == "__main__":
main()
What is going on? Is this an Estimator
class bug/problem? Or am I doing something very very wrong??? Is there something wrong in the code I try? Can you run the same code by yourself?
If I run your code then I get your normal results, except for one time when I got similar results to mine. Why is this happening? You are saving the results in CSVs and figures locally, while I am printing them inline in Spyder console with matplotlib
and I have more #includes? Could any of that be the problem? It would be weird though. Could it be random and happens indeed on random execution trials for some unexplained reason?
For example, even that code fails:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from collections import defaultdict
from datetime import timedelta
from dateutil.relativedelta import relativedelta
import functools
from IPython.display import display, Markdown
import math
import os
from pprint import pprint
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib
from matplotlib.ticker import ScalarFormatter
import numpy as np
import pandas as pd
import dask.dataframe as dd
import seaborn as sns
import scipy as sci
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import sympy as sym
from pathlib import Path
import warnings
import covsirphy as cs
def main():
# cs.get_version()
# get_ipython().run_line_magic('matplotlib', 'inline')
# pd.plotting.register_matplotlib_converters()
# # Matplotlib
# plt.style.use("seaborn-ticks")
# plt.rcParams["xtick.direction"] = "in"
# plt.rcParams["ytick.direction"] = "in"
# plt.rcParams["font.size"] = 11.0
# plt.rcParams["figure.figsize"] = (9, 6)
warnings.simplefilter("error")
# Create output directory in example directory
code_path = Path(__file__)
input_dir = code_path.parent.with_name("input")
output_dir = code_path.with_name("output").joinpath(code_path.stem)
output_dir.mkdir(exist_ok=True, parents=True)
# Load datasets
data_loader = cs.DataLoader(input_dir)
jhu_data = data_loader.jhu()
population_data = data_loader.population()
# Set country name
country = "USA"
abbr = "usa"
figpath = functools.partial(
filepath, output_dir=output_dir, country=abbr, ext="jpg")
# Start scenario analysis
snl = cs.Scenario(jhu_data, population_data, country)
# Show records
record_df = snl.records(filename=figpath("records"))
save_df(record_df, "records", output_dir, abbr, use_index=False)
# Show S-R trend
snl.trend(filename=figpath("trend"))
print(snl.summary())
# Parameter estimation
snl.estimate(cs.SIRF)
snl.history("Rt", filename=figpath("history_rt_past"))
# Add future phase to main scenario
snl.add(name="Main", days=7)
# Simulation of the number of cases
sim_df = snl.simulate(name="Main", filename=figpath("simulate"))
save_df(sim_df, "simulate", output_dir, abbr, use_index=False)
# Parameter history
for item in ["Rt", "rho", "sigma", "Infected"]:
snl.history(item, filename=figpath(f"history_{item.lower()}"))
# Change rate of parameters in main scenario
snl.history_rate(name="Main", filename=figpath("history_rate"))
snl.history_rate(
params=["kappa", "sigma", "rho"],
name="Main", filename=figpath("history_rate_without_theta"))
# Save summary as a CSV file
summary_df = snl.summary()
save_df(summary_df, "summary", output_dir, abbr)
def filepath(name, output_dir, country, ext="jpg"):
"""
Return filepath of a figure.
Args:
name (str): name of the figure
output_dir (pathlib.Path): path of the directory to save the figure
country (str): country name or abbr
ext (str, optional): Extension of the output file. Defaults to "jpg".
Returns:
pathlib.Path: filepath of the output file
"""
return output_dir.joinpath(f"{country}_{name}.{ext}")
def save_df(df, name, output_dir, country, use_index=True):
"""
Save the dataframe as a CSV file.
Args:
df (pandas.DataFrame): dataframe
name (str): name of the dataframe
output_dir (pathlib.Path): path of the directory to save the figure
country (str): country name or abbr
use_index (bool): if True, include index
"""
df.to_csv(output_dir.joinpath(f"{name}.csv"), index=use_index)
if __name__ == "__main__":
main()
[UPDATE]: After many tests, I conclude that there might be some problems with the imports I use and they somehow cause this error, I don't know why, it is very frustrating! What do you think? Can your try to run the above code?
One definitive test for you to reproduce it, is to add into your scenario_analysis_usa.py
code, right before the main(), these imports instead of your only 4 ones:
from collections import defaultdict
from datetime import timedelta
from dateutil.relativedelta import relativedelta
# from IPython.display import display, Markdown
import math
import os
from pprint import pprint
# import matplotlib.pyplot as plt
# import matplotlib.cm as cm
# import matplotlib
# from matplotlib.ticker import ScalarFormatter
import numpy as np
import pandas as pd
import dask.dataframe as dd
import seaborn as sns
import scipy as sci
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import sympy as sym
import functools
from pathlib import Path
import warnings
import covsirphy as cs
I did just that to your scenario_analysis_usa.py
script and it leads to the same error:
import seaborn as sns
import sympy as sym
When they are both and only then, it is created I guess some kind of conflict, please further investigate!
Nevertheless, even in that way, the simulated cases are very wrong, especially for the deaths. Am I doing something wrong? Why the notebook's piece of code, which I am running, doesn't seem to be working properly?
I am using Spyder in Windows. I noticed different results when i) I run code as lines selection line-by-line and ii) when I use a def main():
with all my code and then a
if __name__ == "__main__":
main()
Could MPEstimator, which uses multi-cores processing cause a conflict in windows, for parallel threads? Is multiprocessing problematic in interactive interpreter for windows?
I tried scenario_analysis_usa.py
code with importing seaborn and sympy without def main()
for 5 times, but I could not reproduce the weired result.
Please kindly share the output of Scenario.summary()
to see RMSLE scores and runtime.
I will post these results too when I get home from work. Are you running this in windows and in interactive console? I think that probably the root cause is in the multiprocessing.
I ran this code in
And what plots do you get for USA? Is it possible for you to try it also in Spyder?
Please see my codes and outputs with Google Colab (2 CPUs, Python 3.6.9) on https://gist.github.com/lisphilar/4d9dbc5ea39ef808e2139e2fd4b655f4
I have not installed and used Spyder, but I will try it tomorrow. I'm using Bash in WSL and VS code for coding and running codes.
Could you try Scenario.estimate(cs.SIRF, n_jobs=1) to stop multiprocessing?
Have you tried Spyder? I tried to use n_jobs=1
; sometimes it runs smoothly (after a long long time), but usually it has similar results:
I also run your single py file and had this result:
which is very very wrong about the deaths simulated cases at the end (the quickly rising line):
21-10-20 | 8401849 | 224660 | 4869492 | 3307697 |
---|---|---|---|---|
22-10-20 | 8474386 | 225862 | 4918341 | 3330183 |
23-10-20 | 8547633 | 227075 | 4967664 | 3352894 |
24-10-20 | 8621600 | 228301 | 5017465 | 3375834 |
25-10-20 | 8696290 | 229539 | 5067748 | 3399003 |
26-10-20 | 8704446 | 225714 | 5018277 | 3460455 |
27-10-20 | 8785132 | 249457 | 5049937 | 3485738 |
28-10-20 | 8866306 | 273344 | 5081782 | 3511180 |
29-10-20 | 8947973 | 297376 | 5113814 | 3536783 |
30-10-20 | 9030133 | 321553 | 5146033 | 3562547 |
31-10-20 | 9112790 | 345876 | 5178441 | 3588473 |
01-11-20 | 9195945 | 370346 | 5211037 | 3614562 |
02-11-20 | 9279604 | 394964 | 5243824 | 3640816 |
03-11-20 | 9363767 | 419731 | 5276801 | 3667235 |
04-11-20 | 9448437 | 444646 | 5309971 | 3693820 |
05-11-20 | 9533617 | 469712 | 5343333 | 3720572 |
06-11-20 | 9619310 | 494929 | 5376889 | 3747492 |
07-11-20 | 9705518 | 520297 | 5410640 | 3774581 |
08-11-20 | 9792244 | 545818 | 5444586 | 3801840 |
09-11-20 | 9879491 | 571492 | 5478729 | 3829270 |
10-11-20 | 9967261 | 597320 | 5513069 | 3856872 |
11-11-20 | 10055557 | 623303 | 5547607 | 3884647 |
12-11-20 | 10144383 | 649442 | 5582345 | 3912596 |
I installed Spyder4 and Python 3.8 with Anaconda3 in WIndows OS for the first time. Then, I install Covsirphy version 2.10.0 (codes are the same as 2.9.1-theta) and ran the next codes.
import seaborn as sns
import sympy as sym
import covsirphy as cs
cs.__version__
# Load datasets
data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu(verbose=False)
population_data = data_loader.population(verbose=False)
# Set country name
country = "United States"
abbr = "usa"
# Start scenario analysis
snl = cs.Scenario(jhu_data, population_data, country)
# Show records
snl.records().tail()
# Show S-R trend
snl.trend().summary()
# Parameter estimation
snl.estimate(cs.SIRF)
snl.summary()
# Add future phase to main scenario
snl.add(name="Main", days=7)
# Simulation of the number of cases
snl.simulate(name="Main").tail()
However, parameter estimation in phases except for 11th phase was not completed for > 10 min.
I'm trying a solution with https://stackoverflow.com/questions/48078722/no-multiprocessing-print-outputs-spyder (Run > Configuration per file > Execute in an external system terminal), but the codes do not finish.
Is it necessary to upload CovsirPhy package to Anaconda?
Is it necessary to upload CovsirPhy package to Anaconda?
I don't think so, you can run inside a conda environment console "pip install covsirphy" first and then run from there spyder (at least that's what I am doing)
I did pip installation in conda console and get the error I mentioned in the above. Are the other settings necessary for multiprocessing in Spyder?
Or, please share the output of Scenario.summary()
to discuss runtime and the number of trials.
As far as I know, no. The multiprocessing runs automatically, unless specified otherwise. But as I said earlier, it may lead to problematic results inside an interactive python console, such as Spyder's.
However, parameter estimation in phases except for 11th phase was not completed for > 10 min.
How did you determine that? Oh and I thought you were referring to an error! This displayed message is normal behavior I guess, only the last phase result is printed out in the console! This happens to my executions as well all the time for multiprocessing. But this is not an error I think. My result with multiprocessing is:
<SIR-F model: parameter estimation> Running optimization with 12 CPUs... 11th phase (27Oct2020 - 06Nov2020): finished 630 trials in 1 min 1 sec Completed optimization. Total: 2 min 9 sec
Scenario.summary()
result after having executed the estimator with multiprocessing:Type Start End Population ODE Rt theta kappa rho sigma tau 1/alpha2 [day] 1/gamma [day] 1/beta [day] alpha1 [-] RMSLE Trials Runtime 0th Past 09Feb2020 02May2020 336218660 SIR-F 9.94 0.032099047 0.000876828 0.046035181 0.003606238 1440 1140 277 21 0.032 19.78753647 360 1 min 1 sec 1st Past 03May2020 15Jun2020 336218660 SIR-F 2.13 0.021942935 0.00080148 0.019559868 0.008189574 1440 1247 122 51 0.022 0.202848849 356 1 min 1 sec 2nd Past 16Jun2020 05Jul2020 336218660 SIR-F 3.56 0.003325108 0.000327093 0.023472848 0.006241644 1440 3057 160 42 0.003 0.029449712 356 1 min 1 sec 3rd Past 06Jul2020 17Jul2020 336218660 SIR-F 4.23 0.002637798 0.000301307 0.030498875 0.006882155 1440 3318 145 32 0.003 0.009950549 169 0 min 35 sec 4th Past 18Jul2020 29Jul2020 336218660 SIR-F 2.56 0.003573687 0.000301057 0.025239216 0.009510954 1440 3321 105 39 0.004 0.010416159 173 0 min 35 sec 5th Past 30Jul2020 11Aug2020 336218660 SIR-F 2.07 0.018398421 4.06E-05 0.016347777 0.007706156 1440 24651 129 61 0.018 0.011220036 269 0 min 51 sec 6th Past 12Aug2020 27Aug2020 336218660 SIR-F 2.11 0.002041556 0.000191442 0.013334141 0.006106795 1440 5223 163 74 0.002 0.014749519 203 0 min 41 sec 7th Past 28Aug2020 14Sep2020 336218660 SIR-F 1.74 0.000299137 0.000230083 0.010281454 0.005672952 1440 4346 176 97 0 0.00709466 360 1 min 1 sec 8th Past 15Sep2020 30Sep2020 336218660 SIR-F 1.8 0.017403944 4.06E-05 0.010370716 0.00561613 1440 24651 178 96 0.017 0.007072636 265 0 min 51 sec 9th Past 01Oct2020 15Oct2020 336218660 SIR-F 2.46 0.00180556 9.35E-05 0.013093239 0.005216354 1440 10700 191 76 0.002 0.012906504 150 0 min 30 sec 10th Past 16Oct2020 26Oct2020 336218660 SIR-F 2.58 0.003859091 5.92E-05 0.013516793 0.005165612 1440 16887 193 73 0.004 0.004996453 228 0 min 45 sec 11th Past 27Oct2020 06Nov2020 336218660 SIR-F 2.94 0.014617741 3.17E-05 0.017082992 0.005688771 1440 31503 175 58 0.015 0.010969537 630 1 min 1 sec
I found the number of trials (Trials
column) is lower in Spyder/Windows than that in Gogle colaboratory. (1st phase: 356 << 1187)
https://gist.github.com/lisphilar/4d9dbc5ea39ef808e2139e2fd4b655f4
We have two solutions.
Scenario.estimate(cs.SIRF, timeout=60)
(default: 60 sec) to Scenario.estimate(cs.SIRF, timeout=120)
(120 sec)covsirphy.simulation.Estimator
etc. to cut runtime #291 I'm trying the second solution from Feb2020, but I could not find how to improve it.
Did you find any solution for the weird simulated results?
For the USA what bothers me is the triangular curve and for Spain the totally wrong deaths.
Any news for the Spyder problems? What can I do to have more normal results?
I could not reproduce your figures with Spyder.
I'm trying to improve Estimator
class in #270 and #291 .
Please try to
timeout
argument as I mentioned in the previous comment andOh yes you are right, the small timeout seems to be the root cause for the weird triangular shape, for some reason. I use now always timeout=120
. But why is this needed? I think you don't use 120sec right? How come you can run this correctly with 60sec? Is this platform dependent?
Nevertheless, for USA I have this weird infected spike at the beginning, which you don't have:
And for Spain, despite the fact that the curves seem normal, the simuled cases error is big, for confirmed ~20% and for deaths ~6.4% (compared to today's cases, 12Nov20). Can you tell me what are your simulated confirmed and deaths cases for 12Nov20?
Required runtime of parameter estimation depends on CPU performance. Max value of runtime is set as timeout
argument. With Estimator
class, CovsirPhy continues estimation while runtime does not over timeout
and max simulated number of cases is not in the range of (max actual number * 0.98, max actual number * 1.02)
.
I changed default value of timeout
from 60 sec to 180 sec with issue #291 and pull request #327.
Please find the results for USA and Spain with Google Colab and version 2.10.0-mu as-of 13Nov2020. https://gist.github.com/lisphilar/6d235d2ca3355e54687ba8cd27815022#file-usa_spain_13nov2020-ipynb
Do you have any ideas to improve Estimator
and ChangeFinder.run()
?
Spikes are raised when accuracy of estimation is low and the number of cases are over-estimated. This is because of normal behaivior that initial values of past phases are fixed to the initial values of actual data of the phases.
Yes fortunately I have now the same results with you at last! The timeout parameter is very significant. I have some questions about the results for Spain. What does this line mean in the figure? And most importantly, why the recovered and deaths curves are not strictly monotonically increasing? Should be added a check during the simulation, that the results (deaths, recovered and confirmed) must monotonically increase?
Spikes are raised when accuracy of estimation is low and the number of cases are over-estimated. This is because of normal behaivior that initial values of past phases are fixed to the initial values of actual data of the phases.
Is this related to the initial phase further subdivision problem? And to tackle both issues, do we have to improve
ChangeFinder.run()
?
What does this line mean in the figure?
The line means that there are many phases in the narrow range. This is because the number of infeced cases increased sharply.
why the recovered and deaths curves are not strictly monotonically increasing?
This is because simulation is performed in each phase and combined. For the change dates of phases, values of start dates of next phases will be used. When accuracy of estimation is low and the number of cases are over-estimated, spikes occur and monotonic increasing is broken.
For Spain, I think the first four phases are accurately splited. Estimator
should be improved.
So in order to overcome these spikes occurrences and ensure monotonic increasing behavior, what should we do? For the simulated cases right before and after the phase change date, could we rework the values and adapt them in such a way that it is monotonically increasing?
I do not have concrete ideas, but rework after estimation could be selected as you mentioned.
I updated my idea.
New method Scenario.trend_estimate(model, timeout=120, max_iteration=3, **kwargs)
performs the following steps.
timeout
The results with version 2.13.1-alpha are in https://gist.github.com/lisphilar/8f64aca428a3eb065c8c39ef827b3fc5
We move forward to #354 (and new issues) to improve the accuracy of parameter estimation.
Summary
For USA, the confirmed cases predictions for 24Oct20 are a little bit off by ~250.000, but deaths cases predictions are totally wrong (actual ~229.000, predicted ~455.000 almost doubled). The first time I executed:
Codes and outputs:
I found a weird way to overcome this though. If I run again the trends and the estimator without reinitializing the usascenario object: = usa_scenario.trend() usa_scenario.summary() usa_scenario.estimate(cs.SIRF) usascenario.summary() = usa_scenario.history_rate() usa_scenario.clear() usa_scenario.add(days=7) usa_pred = usa_scenario.simulate() -> observe usa_pred
Then the deaths are very accurate (off only by ~5000).
Why is this happening and how can this be solved? Please further investigate.
Also, one time I noticed and a third behavior. On some execution trial the weird death predictions followed the same weird triangular pattern (as in Greece deaths predictions, see #271 and as in Russia/Spain death predictions, see #273). And this time if I rerun the analysis as I said before without reinitializing the scenario object, the error now still persists. It is very weird and frustrating and seems to be random depending on the execution trial.
Environment