Scenario.trend() phase visualization problems and runtime error

Inglezos commented 3 years ago

Summary

Scenario.trends() is not working properly for India, USA, Greece and Brazil. It doesn't show colored sections for phases in graphs:
For Germany, upon trends() call, it throws error RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 600. and the trends() functionality doesn't work at all.
For all the other countries I tested, trends() works normally as before.

Codes and outputs:

country_scenario = cs.Scenario(jhu_data, population_data, "country") countryscenario.records().tail() = country_scenario.trend()

Environment

CovsirPhy version: 2.9.1
Python version; 3.8
Installation: Anaconda, Python, pip
OS: Windows

lisphilar commented 3 years ago

I fixed the runtime error with #285 with https://stackoverflow.com/questions/15831763/scipy-curvefit-runtimeerroroptimal-parameters-not-found-number-of-calls-to-fun

Please confirm it with development version 2.9.1-alpha.fix.282

pip install --upgrade "git+https://github.com/lisphilar/covid19-sir.git#egg=covsirphy"

Scenario.trends() is not working properly for India, USA, Greece and Brazil. It doesn't show colored sections for phases in graphs

It's in the specifications #121 Please change Scenario.trend(max_rmsle=20) to Scenario.trend(max_rmsle=100) etc.

However, we may have a room to improve it. Could you help with me to find the root cause? Scenario.trend() depends on ChangeFinder class. https://github.com/lisphilar/covid19-sir/blob/master/covsirphy/phase/sr_change.py

Inglezos commented 3 years ago

Yes the runtime error (for Germany) is fixed now in version 2.9.1-alpha.fix.282. The underlying problem is why RMSLE scores for the phases recognition is that big. We would expect the trends to fit well the actual data curve, right? As they do well for the other countries.

Scenario.trends() is not working properly for India, USA, Greece and Brazil. It doesn't show colored sections for phases in graphs

It's in the specifications #121 Please change Scenario.trend(max_rmsle=20) to Scenario.trend(max_rmsle=100) etc.

I don't think that the max_rmsle value is the problem. For example for India, if I call trends() with default max_rmsle, then the RMSLE scores are normal and not above the default, which would explain why they are not shown in color, but that's not the case:

On the other hand, if I do what you propose and set the max_rmsle threshold to 100, then this happens upon trends() call, while the RMSLE scores are exactly the same as before:

The same still happens for USA, Greece and Brazil (as far as I tested today again), while their cases "predictions" (for USA, Brazil) are way off compared to the actual today's data. (By "predictions" I mean the simulated cases of the model based on the estimated parameters of the latest phase, I learned the difference!) I don't understand why all this happens, do I run something wrong? Or is there something problematic with the trends analysis? The relevant code section we need to rework/improve/debug is located at Scenario.trend() and the ChangeFinder class you said?

lisphilar commented 3 years ago

Sorry for not making it clear enough, but RMSLE score of Scenario.trend() is not the same as that of Scenario.estimate(). The steps of calculating RMSLE of trend analysis is as follows.

Perform curve fitting, x is Recovered, y is Susceptible
With parameter values of the curve, Susceptible will be calculated
Calculate RMSLE score of actual Susceptible and Susceptible in 2.

lisphilar commented 3 years ago

It appears that curve fitting was failed for Indea records etc. We should revise ChangeFinder class.

FYI, Call tree is as follows.

cs.analysis.scenario.Scenario.trend() calls cs.phase.phase_series.PhaseSeries.trend() 2.PhaseSeries.trend() calls cs.phase.sr_change.ChangeFinder.run(): find change points
PhaseSeries.trend() calls cs.phase.sr_change.ChangeFinder.show()
ChangeFinder.show() calls ChangeFinder._curve_fitting()
ChangeFinder._curve_fitting() calls cs.phase.trend.Trend.run(): Perform curve fitting
ChangeFinder._curve_fitting() calls cs.phase.trend.Trend.rmsle(): Calculate RMSLE score, if over max_rmsle, remove colored line from the figure.

Inglezos commented 3 years ago

I am studying the code in order to find any weak spots for improvement/debug. For the time being I have some ideas to test: i. in sr_change.py -> run(), keep more samples? ii. at rpt.Pelt(), change jump size? iii. in algorithm.fit_predict() call, change pen? iv. change min_size from 7 to something smaller v. in Trend class -> _fitting() change maxfev vi. Another cost function instead of rbf for ruptures. Other good choices could be normal, mahalanobis or rank. See also http://dev.ipol.im/~truong/ruptures-docs/build/html/costs/index.html Or create a new custom cost function (would be difficult)? See also: http://dev.ipol.im/~truong/ruptures-docs/build/html/costs/costcustom.html#sec-custom-cost, http://dev.ipol.im/~truong/ruptures-docs/build/html/general-info.html#sec-new-cost
vii. A more general question: why do we study phases at the S-R plane specifically and not for example at the C(confirmed) - D(deaths) plane? Could we use another curve more effectively to apply trends analysis?
So, in order to have the colored line for any RMSLE, I just have to comment out the lines: if trend.rmsle() > self.max_rmsle: `df[f"{self.S}{self.P}"] = None` Because I tried this, but it did not work. Maybe the curve fitting fails completely? How can we see the trends() RMSLE scores?

lisphilar commented 3 years ago

i. in sr_change.py -> run(), keep more samples?

Yes, incleasing sample size is a nice solution. I used the following codes to cut runtime, but I will try it.

# Sampling to reduce run-time of Ruptures
samples = np.linspace(0, series.index.max(), len(self.sr_df), dtype=np.int64)
series = series[samples]

ii. at rpt.Pelt(), change jump size?

Documentation of Rutures says "jump: reduce the set of possible change point indexes; predicted change points can only be a multiple of jump."

I think this should be 1 as it is because we do not have information about how many change points are there.

iii. in algorithm.fit_predict() call, change pen?

Possible.

iv. change min_size from 7 to something smaller

Default value can be changed.

v. in Trend class -> _fitting() change maxfev

This only fixes RuntimeError as discussed.

vi. Another cost function instead of rbf for ruptures. Other good choices could be normal, mahalanobis or rank.

Tried, but failed.

vii. A more general question: why do we study phases at the S-R plane specifically and not for example at the C(confirmed) - D(deaths) plane? Could we use another curve more effectively to apply trends analysis?

This is because (x, y) = (R, log(S)) are on a line for SIR, SIR-D, SIR-F, SIR-FV, SEWIR-F. Please ensure to read

How can we see the trends() RMSLE scores?

As follows.

# Setting
import covsirphy as cs
data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()
# Find change points
population = population_data.value(country="Japan")
sr_df = jhu_data.to_sr(country="Japan", population=population)
finder = cs.ChangeFinder(sr_df, min_size=7, max_rmsle=20.0)
finder.run()
# Get the start date and end date of a phase (e.g. 1st phase)
[
    cs.Trend(
        jhu_data.to_sr(
            country="Japan", population=population, start_date=start_date, end_date=end_date
        )
    ).rmsle()
    for (start_date, end_date) in zip(*finder.date_range())
]
# -> [0.00017165017964160256, 0.00011803137359578386, 4.346220556783464e-06, 
# 5.094880108913458e-06, 2.124030851469172e-05, 4.769860659337155e-06, 
# 2.410881918990526e-06, 2.4281797106340264e-06, 2.0745364412988465e-06, 
# 1.8926126283957956e-06, 1.5375406407613923e-05]

Inglezos commented 3 years ago

I am debugging the code and I added some print(RMSLEs) to get these scores. I am working on some fixes and I believe I have found a good solution for the trends issue we have. During debugging I saw that for some phases, upon analysis for example of India, the simulated cases were negative large numbers. I suspected that maybe this happens only for countries with very large populations -> Susceptibles and maybe some overflows upon numerical calculations during curve fitting occur. So, okay, the S-R fitting is negative exponential, but this leads as we see to problematic curve fitting situations, especially in case of large numbers. Generally I think curve fitting works better with smaller numbers. What if we transformed S -> log(S), as is done in trend.run() though? Then we have to handle smaller numbers with linear fitting. That was my intuition, I changed the code a little bit and ran tests for many countries and I saw that now the trends are much much better. Every phase has minimal fitting error. Also, I changed the default min_size to 5 and kept maxfev to 5000. Also, in case the linear error is significant (larger for i.e. 0.5), then the negative exponential function will be used for fitting.

I have created a pull request with my fixes. Please check it out, test my changes and let me know what you think.

lisphilar commented 3 years ago

I agree with your idea to use linear regression and thank you for your pull request! I added some coments on it. Test codes will be in tests/test_change_finder.py (pytest) in this repository.

Inglezos commented 3 years ago

Okay so after merging my changes/fixes, if your tests also confirm that now the trends analysis runs smoothly for various countries, I think we can close this issue. But first please inform me, what should I do in tests/test_change_finder.py file? What should I add?

lisphilar commented 3 years ago

Thank you for your pull request and I merged it. I created tests and I changed some codes to perform tests efficiently with #303 .

Trend.run() was renamed to Trend._run()
New Trend.run() was created and this returns itself and this select the best function from "linear" and "negative exponential"

Sorry for update after merging your pull request.

lisphilar commented 3 years ago

In tests/test_change_finder.py, We check if RMSLE scores of all phases are lower than set max value (20.0, this means colored line will be shown) in the listed countries. TestChangeFinder.test_find() was updated. Italy, japan, Indea, United States, Greece, Russia, Brazil, France, Spain, UK, New Zealand, Germany.

lisphilar commented 3 years ago

Dear @Inglezos , Very sorry for bother you again. I will merge #303 when all tests in tests/test_change_finder.py, tests/phase_serues.py and tests/scenario.py will passed with my local environment. If we need some changes in Trend, ChangeFindeer and test codes, please create pull requests or issues. Thank you.

Inglezos commented 3 years ago

Sorry, I am busy these days at work. Yes sure, if I find any issue I'll let you know.

lisphilar commented 3 years ago

Thank you for your discussion and pull request. I will close this issue.

lisphilar / covid19-sir