Closed andrybicio closed 3 years ago
Using SEWIR-F model for parameter estimation is not recommended because we have only four observed variables (S, I, R, F) and this model needs 6 variables.
Example for the second solution:
https://gist.github.com/lisphilar/b474c58dd4ae785fbee3d43287aa193f For quick note, timeout is set to 10 sec.
I will take care and try. I didn't understand this could be done. Thank you
As you can see from your notebook (last cell executed), timeseries of newly fitted parameters "rho1", "rho2" and "rho3" is constant and does not change wrt phase. I'm not sure this is correct.
It happened to mine as well. https://github.com/andrybicio/LDE_Project/blob/main/LDE_Project_Raspberry-Def.ipynb
we have only four observed variables (S, I, R, F)
One secondary question on that, what are the actual/observed variables? I am aware that each country for sure announces daily the confirmed, the deaths and the recovered as well. That doesn't give us at most three observed variables? The fourth, the infected is calculated from the other three. Is this correct? Or some countries announce the active cases/infected additionally? Because if so, I think that we don't use this at the moment.
The susceptible S are extracted from the population dataset for each country/province. But (C, R, F) are indeed announced as you correctly said from the government, and you compute the actual infected (I) by subtracting the number of recovered (R) and fatalities (F) from the confirmed cases (C).
Sometimes it happen, by the way, that also active cases (people in hospital, ICU, quarantined) are announced.
Consistent rho1/rho2/rho3: Thank you for your feedback, I will check it later.
Observed variables:
Yes, the correct variables are "Population" and C/F/R. Because we have one-to-one correspondence conversion rule (S, I) = (Population - C, C - F - R), pseudo observed variables are S/I/F/R.
The number of patients in ICU etc. are reported as mentioned and it will be great to add dataloader for them (but may need auto complement method for some countries..). They may be added to the next ODE model discussed in #379.
Because Exposed and Waiting are not observed, parameter estimation with SEWIR-F model could be deprecated. I expected we can calculate some of rho1/rho2/rho3 using linelist data, but seems not.
Regarding the parameter estimation for serial intervals and so forth using the linelist data, I would suggest the following notebook which I am taking inspiration from too. Some URLs are out-dated however.
https://github.com/micheletizzoni/covid-19/blob/master/Realtime%20R%20-%20MCMC.ipynb
This is the Python version. In the beginning of the notebook there is the link to the original R version.
[UPDATE] The developer of this library told me there is a most update version for the notebook above, which can be found at the following link: https://github.com/rtcovidlive/rtlive-global
Today I don't have much time for coding, but I agree with the idea
This seems like full complement of recovery data in some countries with recovery period.
[Update]
If this solution will be used, we need to add "Onset_date" column to LinelistData.cleaned()
. The data source of LinelistData
is the same as the notebook mentioned.
Yes, indeed I tried to take a look at the df of LinelistData.cleaned() and column was not present, therefore I downloaded it manually. Moreover the distribution of time between onset and confirmed times has too many values present at dt=0, if you run it with values at present times. This does not convince me so much, honestly, but could be due to the fast response of healthcare system.
I calculated elapsed time (confirmed date - onset date) and get many negative values. As reported by media, many cases shows symptoms after PCR confirmation. I wrote "onset = infected without confirmation and waiting for PCR test", but we may need to revise this definition. https://gist.github.com/lisphilar/6ded6490b4f902cb16c3dcd30ec776d2
I would suggest there is an error in methodology (or in the dataset?) since frequence for negative values is way too large...
I'm seeing in the notebook I linked before (*) that it explicitely keeps such "positive" records, I think we should keep only that subset as well:
# Only keep records where confirmed > onset
patients = patients[patients.Confirmed >= patients.Onset]
On the contrary, there are too many values with large negative value (how can a person show symptoms 100-200 AFTER it has been tested positive?). I mean anything could happen in Nature, but they should appear as outliers, which is not our case.
(*) there is an updated one, with newer methodology, please see above. I'm taking a look into it at the moment)
Please have a look manually in the original csv file in excel. If you confirm these weird dates, then this would probably be an issue in the dataset. This seems very possible. Then contact the source team or open a github issue in their repo. If they from their side collect correct data indeed, then I am afraid the Healthcare records themselves are wrong in the first place as they are filled by the hospitals and published, and in that case we cannot further process these, we will have to exclude them.
If they from their side collect correct data indeed, then I am afraid the Healthcare records themselves are wrong in the first place as they are filled by the hospitals and published, and in that case we cannot further process these, we will have to exclude them.
yes my issue refers mainly to the quality of data. Sometimes hospitals wait for some time to record the number of recovered people, therefore you see unpredictable spikes (and non derivable points). In Italy I know this occurs quite often. Unfortunately this data has to be "corrected", "filtered" and/or "completed" somehow. Moreover it happened also that the daily number of cases was negative, due to some corrections applied on the following day, therefore these imprecisions must be accounted for somehow. When daily number of cases was negative, I used to set it to zero, but this was my choice.
Fun fact: Lombardy (you know the italian city "Milan") has been reporting wrong data since November, and that is one of the largest/richest region of our country. This led to more and stricter restrictions, when maybe this wouldn't have been needed. Just to tell you how data can be biased already at the source.
With some reasons (bias or statiscally-unexpected method in data collection), outliers could be removed from our analysis. However, I can say the histograms indicate random distribution, not normal distribution and so on. Stratification may helps us, but it will be difficult to get representative values from them.
As discussed in #573, SIR-F model is the final model for parameter estimation at this time. Parameter estimation with SEWIR-F will raise NotImplementedError: SEWIR-F cannot be used for parameter estimation because we do not have records of Exposed and Waiting. Please use SIR-F model with covsirphy.SIRF class.
Summary of question
Dealing with Italy dataset and using S-R trend analysis to compute the phases, we obtain some values that are not realistic when using SEWIR compartmental model to fit the curve. Rt estimation may even explode (even to 1000!). However, RMSLE is quite low.
What tried
I haven't set any further parameter for the trend estimation, it is the "very basic and default" one.
Code is presented here in the following URL, I am running locally on my laptop the rest of parameters estimations.
https://github.com/andrybicio/LDE_Project/blob/main/CovSIRphy_Analysis_v3.ipynb
For estimating the model I used:
maybe the timeout is too low? Needs to be increased further since it does not reach convergence?