[New/Revise] deprecation of SEWIR-F parameter estimation because Rt explodes, despite RMSLE score is low

andrybicio commented 3 years ago

Summary of question

Dealing with Italy dataset and using S-R trend analysis to compute the phases, we obtain some values that are not realistic when using SEWIR compartmental model to fit the curve. Rt estimation may even explode (even to 1000!). However, RMSLE is quite low.

What tried

I haven't set any further parameter for the trend estimation, it is the "very basic and default" one.

Code is presented here in the following URL, I am running locally on my laptop the rest of parameters estimations.

https://github.com/andrybicio/LDE_Project/blob/main/CovSIRphy_Analysis_v3.ipynb

For estimating the model I used:

National_scenario = cs.Scenario(jhu_data, population_data, country="Italy", province= None, tau=1440)
# Fix the first date of records
National_scenario.first_date = "24Mar2020"
# Fix the last date of records
National_scenario.last_date = "08Jan2021"
National_scenario.trend()
National_scenario.estimate(cs.SEWIRF, timeout = 360)

maybe the timeout is too low? Needs to be increased further since it does not reach convergence?

lisphilar commented 3 years ago

Using SEWIR-F model for parameter estimation is not recommended because we have only four observed variables (S, I, R, F) and this model needs 6 variables.

Please use SIR/SIR-D/SIR-F model for parameter estimation, or
run parameter estimation with fixed some parameters. (this is not for the above root problem, but may improve the situation.)

Example for the second solution:

Pre-estimate some of the SEWIR-F parameters: kappa/sigma/theta are also included in SIR-F
Estimate all parameter values with some fixed (pre-estimated) parameters

https://gist.github.com/lisphilar/b474c58dd4ae785fbee3d43287aa193f For quick note, timeout is set to 10 sec.

andrybicio commented 3 years ago

I will take care and try. I didn't understand this could be done. Thank you

andrybicio commented 3 years ago

As you can see from your notebook (last cell executed), timeseries of newly fitted parameters "rho1", "rho2" and "rho3" is constant and does not change wrt phase. I'm not sure this is correct.

It happened to mine as well. https://github.com/andrybicio/LDE_Project/blob/main/LDE_Project_Raspberry-Def.ipynb

Inglezos commented 3 years ago

we have only four observed variables (S, I, R, F)

One secondary question on that, what are the actual/observed variables? I am aware that each country for sure announces daily the confirmed, the deaths and the recovered as well. That doesn't give us at most three observed variables? The fourth, the infected is calculated from the other three. Is this correct? Or some countries announce the active cases/infected additionally? Because if so, I think that we don't use this at the moment.

andrybicio commented 3 years ago

The susceptible S are extracted from the population dataset for each country/province. But (C, R, F) are indeed announced as you correctly said from the government, and you compute the actual infected (I) by subtracting the number of recovered (R) and fatalities (F) from the confirmed cases (C).

Sometimes it happen, by the way, that also active cases (people in hospital, ICU, quarantined) are announced.

lisphilar commented 3 years ago

Consistent rho1/rho2/rho3: Thank you for your feedback, I will check it later.

Observed variables: Yes, the correct variables are "Population" and C/F/R. Because we have ~~one-to-one correspondence~~ conversion rule (S, I) = (Population - C, C - F - R), pseudo observed variables are S/I/F/R.

The number of patients in ICU etc. are reported as mentioned and it will be great to add dataloader for them (but may need auto complement method for some countries..). They may be added to the next ODE model discussed in #379.

Because Exposed and Waiting are not observed, parameter estimation with SEWIR-F model could be deprecated. I expected we can calculate some of rho1/rho2/rho3 using linelist data, but seems not.

andrybicio commented 3 years ago

Regarding the parameter estimation for serial intervals and so forth using the linelist data, I would suggest the following notebook which I am taking inspiration from too. Some URLs are out-dated however.

https://github.com/micheletizzoni/covid-19/blob/master/Realtime%20R%20-%20MCMC.ipynb

This is the Python version. In the beginning of the notebook there is the link to the original R version.

[UPDATE] The developer of this library told me there is a most update version for the notebook above, which can be found at the following link: https://github.com/rtcovidlive/rtlive-global

lisphilar commented 3 years ago

Today I don't have much time for coding, but I agree with the idea

to calculate the period from onset date to confirmation date with linelist data, and
to estimate the number of onset (=infected without confirmation and waiting for PCR test, W) cases with the calculated period.

This seems like full complement of recovery data in some countries with recovery period.

[Update] If this solution will be used, we need to add "Onset_date" column to LinelistData.cleaned(). The data source of LinelistData is the same as the notebook mentioned.

andrybicio commented 3 years ago

Yes, indeed I tried to take a look at the df of LinelistData.cleaned() and column was not present, therefore I downloaded it manually. Moreover the distribution of time between onset and confirmed times has too many values present at dt=0, if you run it with values at present times. This does not convince me so much, honestly, but could be due to the fast response of healthcare system.

lisphilar commented 3 years ago

I calculated elapsed time (confirmed date - onset date) and get many negative values. As reported by media, many cases shows symptoms after PCR confirmation. I wrote "onset = infected without confirmation and waiting for PCR test", but we may need to revise this definition. https://gist.github.com/lisphilar/6ded6490b4f902cb16c3dcd30ec776d2

andrybicio commented 3 years ago

I would suggest there is an error in methodology (or in the dataset?) since frequence for negative values is way too large...

I'm seeing in the notebook I linked before (*) that it explicitely keeps such "positive" records, I think we should keep only that subset as well:

# Only keep records where confirmed > onset
patients = patients[patients.Confirmed >= patients.Onset]

On the contrary, there are too many values with large negative value (how can a person show symptoms 100-200 AFTER it has been tested positive?). I mean anything could happen in Nature, but they should appear as outliers, which is not our case.

(*) there is an updated one, with newer methodology, please see above. I'm taking a look into it at the moment)

Inglezos commented 3 years ago

Please have a look manually in the original csv file in excel. If you confirm these weird dates, then this would probably be an issue in the dataset. This seems very possible. Then contact the source team or open a github issue in their repo. If they from their side collect correct data indeed, then I am afraid the Healthcare records themselves are wrong in the first place as they are filled by the hospitals and published, and in that case we cannot further process these, we will have to exclude them.

andrybicio commented 3 years ago

If they from their side collect correct data indeed, then I am afraid the Healthcare records themselves are wrong in the first place as they are filled by the hospitals and published, and in that case we cannot further process these, we will have to exclude them.

yes my issue refers mainly to the quality of data. Sometimes hospitals wait for some time to record the number of recovered people, therefore you see unpredictable spikes (and non derivable points). In Italy I know this occurs quite often. Unfortunately this data has to be "corrected", "filtered" and/or "completed" somehow. Moreover it happened also that the daily number of cases was negative, due to some corrections applied on the following day, therefore these imprecisions must be accounted for somehow. When daily number of cases was negative, I used to set it to zero, but this was my choice.

Fun fact: Lombardy (you know the italian city "Milan") has been reporting wrong data since November, and that is one of the largest/richest region of our country. This led to more and stricter restrictions, when maybe this wouldn't have been needed. Just to tell you how data can be biased already at the source.

lisphilar commented 3 years ago

With some reasons (bias or statiscally-unexpected method in data collection), outliers could be removed from our analysis. However, I can say the histograms indicate random distribution, not normal distribution and so on. Stratification may helps us, but it will be difficult to get representative values from them.

lisphilar commented 3 years ago

As discussed in #573, SIR-F model is the final model for parameter estimation at this time. Parameter estimation with SEWIR-F will raise NotImplementedError: SEWIR-F cannot be used for parameter estimation because we do not have records of Exposed and Waiting. Please use SIR-F model with covsirphy.SIRF class.

lisphilar / covid19-sir

[New/Revise] deprecation of SEWIR-F parameter estimation because Rt explodes, despite RMSLE score is low #569

Summary of question

What tried