[Discuss] Regression tests - multiple countries analysis

Inglezos commented 4 years ago

Summary of new feature

Each time a new feature is developed or mostly when a new stable version is about to be released, I think that it would be very useful some regression tests to run every time, in order to assure that the system operates without any substantial analyses errors.

Solution

This could be done by automated regression tests. They could run analyses and simulations for a certain number of predefined countries (for example China, Japan, Italy, India, USA, Greece, Russia, Spain, France, Brazil, Germany, UK, Netherlands, etc) and would measure the validity of the predictions. For example, let's say if for the previous day the predictions were inside 5% error margin for confirmed and death cases for every country, then the tests are considered successful. If not, then the user will be notified to investigate further accordingly. Something like that, I hope you can visualize the general concept that I mean.

lisphilar commented 4 years ago

I think, with actual data, we are unable to validate the accuracy of models and parameter estimation tools. This is because the parameter values are changed every day with many many reasons, including government responses and behavior of people.

How do you think about validation with theoritical data? We know the parameter values in advance. Codes are in

https://github.com/lisphilar/covid19-sir/blob/master/example/sirf_model.py
SIR to SIRF chapter > SIR-F model section > Hyperparameter optimization subsection in the Kaggle notebook

Inglezos commented 4 years ago

How would such theoretical data be generated? What do you mean? Are we saying the same thing?
What I am suggesting is this: we have the estimated parameters, therefore we can make some predictions about confirmed/deaths/recovered cases. By theoretical data do you mean that predictions?
Wouldn't it suffice to simply compare the predicted cases with only the previous day's actual cases? I mean if the predictions are totally off, which major case we want to cover in these tests, then this simple comparison would easily alert us for any substantial errors. At first, we don't need to make sophisticated regression tests, I think something like this would be enough. After that, we can improve this mechanism and add more robust tests. The important thing is to test for many countries.

lisphilar commented 4 years ago

How would such theoretical data be generated?

As I mentioned in the previous comment, please refer to https://github.com/lisphilar/covid19-sir/blob/master/example/sirf_model.py

What I am suggesting is this:...The important thing is to test for many countries.

Ok, I got it. I agree with you that, as the first test, the test you mentioned is enogth.

Yes, we need to perform parameter estimation for many countries so that we can monitor the results (figures and output of Scenario.summary()). Until we solve #225 , we will use https://lisphilar.github.io/covid19-sir/ Please create a pull request to add .ipynb to ./example directory. CI tool will run the notebook every day and the notebook will be included in the site.

Inglezos commented 4 years ago

Sorry, but I have to ask in advance just to make sure I understood correctly. This notebook what would contain exactly? Do you want me to prepare a working code that will make simple analyses for a handful of countries one after the other consecutively? So whatever .ipynb I add in the example folder will be added to the https://lisphilar.github.io/covid19-sir/ by the CI tool and will be executed regularly and automatically ?

lisphilar commented 4 years ago

Sorry, but I have to ask in advance just to make sure I understood correctly. This notebook what would contain exactly? Do you want me to prepare a working code that will make simple analyses for a handful of countries one after the other consecutively?

Please discard this request if this is not your point. I have misunderstood. Is this a duplication of #276 ?

So whatever .ipynb I add in the example folder will be added to the https://lisphilar.github.io/covid19-sir/ by the CI tool and will be executed regularly and automatically?

Yes, but semi-automated. I click the start button of the CI tool every day.

Inglezos commented 4 years ago

No it is not a duplicate, but related though to that issue, since both need the accuracy metric for evaluation. My point was to run every time analyses for multiple countries, in order to see the model's behavior, a kind of regression testing. Did you understand something else? What notebook did you mean I should create? This multiple countries analysis / regression tests can be used in combination with #290.

lisphilar commented 4 years ago

What codes do you need to document?

Inglezos commented 4 years ago

I don't need to document something. I mean the following point: Every time we need to test the complete system for robustness or a release is imminent, we need to run a regression test. Regression test is a test that runs in order to assure that at least the previous features and implementations run effectively as expected and that everything is okay. In our case, that test could be simply the following: Run country analysis for a handful of countries (i.e. Italy, Japan, India, USA, Greece, Russia, Brazil, France, Spain, UK, Netherlands, Germany). For each country the test shall compare the yesterday's estimated confirmed/deaths cases with the actual ones and if their deviations are inside a specific error threshold (i.e. 5% error), then the test for that country is passed. Similarly, all the tests for all the countries shall pass. Then we can say that the regression test passed successfully and the system runs correctly. At first, this simple test is enough I think. This test could be a combination of all the pre-existing example country analyses *.py scripts you have under the examples/ folder and all these lines of code, as well as for the rest countries, could be put into a single notebook which will run constantly with the CI tool.

lisphilar commented 4 years ago

Actually, test codes with pytest are in tests/ directory of this direcotory. CI tool tests them for all commits and pull requests with the deault branch. How about testing the following codes to monitor RMSLE scores of last phases?

snl = Scenario(jhu_data, population_data, country)
snl.trend()
snl.estimate(cs.SIRF, phases=["last"])
df = snl.summary()
assert df.loc[df.index[-1], "RMSLE"] < 0.1

RMSLE score is better than difference of the number of cases because we may have a new phase tomorrow.

Inglezos commented 3 years ago

RMSLE score is better than difference of the number of cases because we may have a new phase tomorrow.

Perhaps RMSLE score may be technically more representative because of that, of the phase change, but I think the substantial regression test would be to check the difference of the number of cases. Because if this comparison fails, then for sure the simulation is wrong. While on the opposite, if the RMSLE comparison is okay, this does not guarantee that the simulated cases are close enough, and thus "accurate", to the actual ones. For me, that threshold would be for the large scale cases (more than 1 million for example) to be 3%, while for the medium and small scale (up to few hundreds of thousands) to be 5%, for example. The cases comparison would be only for the previous day (or a few days), so no worries about phase change, since the simulated cases have already considered the different phases and they have calculated these simulated cases based on that fact.

lisphilar commented 3 years ago

I think we discussed in another issue, but the difference of actual values and simulated values with fixed parameters does not mean directly failure of parameter estimation because parameters will be changed by government responses and activities of persons. Only when today and tomorrow are included in the same phase, the difference means directly failure of parameter estimation.

In fact, parameters were sharply changed two times in Japan (from the figure of S-R trend analysis).

lisphilar commented 3 years ago

And that we need to focus on the trends, not exact number of cases, because we know the diffculty to get exact values with high accuracy.

lisphilar commented 3 years ago

Your ideas will be the best solution when we will create machine learning models to find the relationship of OxCGRT index and estimated parameter values. But, not for evaluation of parameter estimation.

Inglezos commented 3 years ago

Yes I can understand all these for future phases, but why is this comparison wrong for the previous phases, which are all known? Our trends analysis is correct and can be trusted if the phase RMSLE is relatively small, <<0.1, right? If this is correct, then we have successfully split the timeseries into correct phase divisions without significant error. Thus, the past phases are fixed and predetermined and based on them we calculate the estimated cases for these past phases we know about and we have made sure they are identified correctly. Could you give me a specific country example perhaps where this solution would be wrong?

lisphilar commented 3 years ago

Yes I can understand all these for future phases, but why is this comparison wrong for the previous phases, which are all known?

I thought this was for future dates, including tomorrow. I got it. We will talk about past phases.

Our trends analysis is correct and can be trusted if the phase RMSLE is relatively small, <<0.1, right? If this is correct, then we have successfully split the timeseries into correct phase divisions without significant error.

Absolutely.

Thus, the past phases are fixed and predetermined and based on them we calculate the estimated cases for these past phases we know about and we have made sure they are identified correctly.

Do you mean we need to evaluate all phases as one time series data? RMSLE score is calculated for each phase.

Is the following script meaningful? Outputs are in https://gist.github.com/lisphilar/da89ad0ecac8b15548d4034073c32a73

snl = cs.Scenario(...)
# Actual data
rec_df = snl.records()
# S-R trend analysis and parameter estimation
snl.trend()
snl.estimate(cs.SIRF)
# Simulation with past phases
sim_df = snl.simulate()
# The difference of actual/simulated number of cases
diff_df = (rec_df.set_index("Date") - sim_df.set_index("Date")).dropna()
# abs(variable) / (variable_actual + 1)
df = (diff_df.abs() / (rec_df.set_index("Date") + 1)).dropna()
comp_df = df.loc[:, ["Infected", "Recovered", "Fatal"]]
# Evaluation
comp_df.median()

Inglezos commented 3 years ago

Yes! I meant exactly that!! And the comparison only for the past week is enough or the past few phases. This can be further modified and reworked but the basic idea is this yes :)

lisphilar commented 3 years ago

OK, how do want to modify the previous script?

We will only use the last two phases not to take long time in testing. Tests in tests direcrectory takes about 30 mininutes at this time and it is flustrating when development.

Inglezos commented 3 years ago

The complete regression test shall run for various countries; my current testing countries list is Italy, Japan, India, USA, Greece, Russia, Brazil, France, Spain, UK, Netherlands, Germany.
Then for each country, a total diff value must be calculated, for the last two phases. This could be for example the root mean square error of (simulated cases - actual cases) over these two past phases and per day, for confirmed, deaths and recovered cases.
Finally we shall compare that total diff value to a specific threshold.
That threshold, for the large scale confirmed cases (more than 1 million for example) could be 3%, while for the medium and small scale (up to few hundreds of thousands) could be 5%, for example.
In order to avoid unnecessary double executions of phases and to cut running time, we could have this total diff value as an internal class variable and calculate this every time the estimator runs. As a result, the regression test (for each country) will be a simple comparison to the threshold and will be a quick operation.

lisphilar commented 3 years ago

Then for each country, a total diff value must be calculated, for the last two phases. This could be for example the root mean square error of (simulated cases - actual cases) over these two past phases and per day, for confirmed, deaths and recovered cases.

Infected/Fatal/Recovered will be used. Confirmed is the total value of them.

That threshold, for the large scale confirmed cases (more than 1 million for example) could be 3%, while for the medium and small scale (up to few hundreds of thousands) could be 5%, for example.

How do you implement this system with Python script?

In order to avoid unnecessary double executions of phases and to cut running time, we could have this total diff value as an internal class variable and calculate this every time the estimator runs. As a result, the regression test (for each country) will be a simple comparison to the threshold and will be a quick operation.

I could not grasp the whole picture. please shere the codes.

Inglezos commented 3 years ago

For example a very simple and basic script could do something like this:

import covsirphy as cs
import numpy as np

data_loader = cs.DataLoader(directory="kaggle/input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()

ERROR_THRESHOLD_HIGH_CASES = 3
ERROR_THRESHOLD_MED_LOW_CASES = 5
CONFIRMED_CASES_THRESHOLD = 1000000

for country in ["Italy", "Japan", "India", "USA", "Greece", "Brazil", "Russia", "Spain", "France", "UK", "Germany"]:
    snl = cs.Scenario(jhu_data, population_data, country)

    # Actual data
    rec_df = snl.records()

    # S-R trend analysis and parameter estimation
    snl.trend()
    snl.estimate(cs.SIRF)

    # Simulation with past phases
    sim_df = snl.simulate()

    # The difference of actual/simulated number of cases
    diff_df = (rec_df.set_index("Date") - sim_df.set_index("Date")).dropna()
    df = (diff_df.abs() / (rec_df.set_index("Date") + 1)).dropna()
    comp_df = df.loc[:, ["Confirmed",  "Fatal", "Recovered"]]
    comp_df_pow = comp_df.pow(2)
    comp_df_pow_sum=comp_df_pow.sum(axis=1)
    comp_df_pow_sum_sqrt=np.sqrt(comp_df_pow_sum)
    comp_df_pow_sum_sqrt_normalized=comp_df_pow_sum_sqrt/3
    comp_df_pow_sum_sqrt_percentage = 100*comp_df_pow_sum_sqrt_normalized

    if rec_df.tail(1)["Confirmed"].iloc[0] >= CONFIRMED_CASES_THRESHOLD and \
       comp_df_pow_sum_sqrt_percentage[-2] >= ERROR_THRESHOLD_HIGH_CASES:
           print("Accuracy error for " + country)
           print("error is " + str(comp_df_pow_sum_sqrt_percentage[-2]) +"%")
    elif rec_df.tail(1)["Confirmed"].iloc[0] < CONFIRMED_CASES_THRESHOLD and \
       comp_df_pow_sum_sqrt_percentage[-2] >= ERROR_THRESHOLD_MED_LOW_CASES:
           print("Accuracy error for " + country)
           print("error is " + str(comp_df_pow_sum_sqrt_percentage[-2]) +"%")
    else:
        print("Accuracy test for " + country + " is OK!")

In the above simple example code, I only consider the yesterday's values. As we agreed this will have to be extended to the last two phases instead, of course.

It is a quick operation as you see. The comparison could be integrated inside the simulate(), as a verification post-process step that shall run automatically every time for each country upon snl.simulate() call and inform us properly. (the internal class variable I mentioned earlier is not needed after all)

lisphilar commented 3 years ago

Thank you for sharing the codes! We have the two following steps.

Create Scenario.score() method. This will be discussed in #319. The difference of actual and simulated number of cases will be evaluated with metrics, including "MAE", "MSE", "MSLE", "RMSE" and "RMSLE". I will create and merge a pull request so that we can use the method soon. Scenario.simulate() was not updated because the main usage is to simulate the number of cases in the future phases with fixed or predicted paramer values. Simulation in past phases is not the main usage.
Add tests you mentioned. Tests will be in tests/test_scenario.py. We will edit TestScenario.test_score() method. We need to decide metrics and threshould carefully and not to prolong runtime of tests (30 min at this time).

Additional information: To evaluate the last two phases, we will use the following.

snl.trend()
all_phases = snl.summary().index.tolist()
snl.disable(phases=all_phases[:-2])
snl.estimate(SIRF)
snl.score()

lisphilar commented 3 years ago

This will be disucussed in forum #435.

lisphilar / covid19-sir

[Discuss] Regression tests - multiple countries analysis #279

Summary of new feature

Solution