globaldothealth / monkeypox

Mpox 2022 repository
Other
175 stars 36 forks source link

The number of recovered cases is lower than expected #177

Closed lisphilar closed 1 year ago

lisphilar commented 2 years ago

The number of recovered cases is only 5 in total now. This is lower than expected because there were around 30,000 confirmed/suspected cases on 01Aug2022. image

Notebook for calculation: https://gist.github.com/lisphilar/ae24c369d21cfeb89a673de1f6edb2b9

Recovery period is estimated as 2-4 weeks and case fatality rate was 3-6%. This means about 30,000 cases could be recovered as-of 01Sep2022 with simple calculation.

Monkeypox is usually a self-limited disease with the symptoms lasting from 2 to 4 weeks. Severe cases can occur. In recent times, the case fatality ratio has been around 3–6%. https://www.who.int/news-room/fact-sheets/detail/monkeypox

@aimeehan1 indicated as follows on https://github.com/globaldothealth/monkeypox/issues/127#issue-1303723676.

"Active" versus "recovered/inactive" case status (no longer have the clinical symptoms of monkeypox, they have recovered from acute illness). Example, Italy, Andalusia cases have been reported as active case totals, but we are tracking cumulative totals. Reminder to curators to check cumulative counts (active + inactive). Due to limited metadata, we are not currently able to update individual case status to "recovered/inactive." https://www.rtvsol.es/noticias/andalucia/salud-y-familias-informa-de-que-actualmente-en-andalucia-hay-193-casos-activos-de-viruela-del-mono

We may need to revise data dictionary or curation system.

lisphilar commented 1 year ago

Is it possible to retrieve recovered data from the primary sources? I know line list data was deprecated and just the number of cases (cumulative or daily new) is necessary.

ksewalk commented 1 year ago

WHO does not provide this information, so it is not possible cc: @jim-sheldon @Mougk

jim-sheldon commented 1 year ago

WHO data:

COUNTRY | ISO3 | WHO_REGION | WHO_REGION_SHORTNAME | CasesAll | CasesLast24Hours | CasesLast7Days | DeathsAll | DeathsLast24Hours | DeathsLast7Days | Latitude | Longitude | MPX_DataAvailable | LASTREPDATE

For my own edification, are there techniques to estimate the number of recovered cases, or is that a bad idea?

lisphilar commented 1 year ago

We can estimate cumulative number of recovery cases if "recovery period" [days] (the time period between case confirmation and recovery) is available.

recovery_period = 10 # For COVID-19, 7 - 21 days in my analysis
df["Recovered"] = (df["Confirmed"] - df["Fatal"]).shift(periods=recovery_period, freq="D")

https://github.com/lisphilar/covid19-sir/blob/2ae6e194884475b02fe418334b287d813d7f6550/covsirphy/engineering/_complement.py#L296 https://lisphilar.github.io/covid19-sir/02_data_engineering.html#5.2-Details-of-data-complement

Regarding Monkeypox, we may need to find papers later to select a specific value, but recovery_period could be 14 - 28 days.

Monkeypox is usually a self-limited disease with the symptoms lasting from 2 to 4 weeks.

https://www.who.int/news-room/fact-sheets/detail/monkeypox

Is this not a role of a database?

jim-sheldon commented 1 year ago

Ah, cool, thanks for that!

Could you clarify "is this not a role of a database"? Happy to add to our db if you want :)

lisphilar commented 1 year ago

Thank you for your positive comment!

One weak point of this method is that it highly depends on the accuracy of recovery period estimation. Because only five cases are listed in the (deprecated) line list, it is very difficult to decide the exact value of recovery period. Users may regard the recovery data as raw data mistakenly.

Alternatives:

jim-sheldon commented 1 year ago

"Users may regard the recovery data as raw data mistakenly" In general this is a trap we try to avoid, while still providing as much data as possible. We also find it frustrating that this leads to incomplete data sets.

For the purposes of estimating numbers of recovered cases, we could make a script. Feel free to make a ticket with requirements and assign it to me. Of course, I want to be careful about how we label and share any forecasts and/or estimates. If we added this to our website we would need to clearly indicate to users what is data and what is probability.

lisphilar commented 1 year ago

If we added this to our website we would need to clearly indicate to users what is data and what is probability.

How about the following changes of directory tree at this repository? (I may not understand the current tree, please correct the followings.)

Currently:

(Where is the time series data regarding the number of cases?)

One idea:

For the purposes of estimating numbers of recovered cases, we could make a script. Feel free to make a ticket with requirements and assign it to me.

This project has many tasks and I would be happy to make a pull request, including new Python script for creating ./timeseries_estimated.csv with ./timeseries.csv, just let me know.

Of course, I want to be careful about how we label and share any forecasts and/or estimates. If we added this to our website we would need to clearly indicate to users what is data and what is probability.

For example, ./analysis/timeseries_estimated.csv will be create with recovery_period=21 [days]. Then, URL of the Python script and the warning (no information regarding exact value of recovery period at this time) will be documented on README file.

jim-sheldon commented 1 year ago

Unless it is marked as deprecated, everything in the repo is in use. I intend on removing deprecated files soon; users can access them in the archives on S3, or look through the git commit history. I also need to update the data dictionary.

s3_ui is for accessing archived data. agency_ingestion takes CDC and WHO data and puts it in our database and S3 bucket, and puts ECDC data in our S3 bucket. gh_data_update takes CDC and WHO data and turns it into G.h data and stores it in the database and S3. map_timeseries creates a timeseries data file that our map visualization reads from S3.

latest.csv is the most up to date G.h data, using the most up to date WHO and CDC data.

Regarding forecasting and estimation, unfortunately I don't think documentation will suffice. I am interested in doing this work, but also want to take every possible precaution to ensure viewers of our outputs do not take it as data or prediction. The right balance might be doing that work in a private repo, only sharing it with trusted people, and any publicly-shared results would be carefully labelled (or limited to something like a timeseries graph with confidence intervals). @aimeehan1 @ksewalk @abhidg What do y'all think?

abhidg commented 1 year ago

@lisphilar @jim-sheldon We should not put estimated data in this repository. Furthermore using a single recovery_period will give incomplete estimates without knowing the standard deviation of the underlying distribution. The distribution of recovery times is necessary to produce a timeseries with confidence intervals. Modelling epidemics is usually done on variations of the SIR https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology#The_SIR_model_2 which are stochastic models to estimate number of infectious (and recovered) people.

lisphilar commented 1 year ago

@jim-sheldon @abhidg Yes, thank you for discussion and we can mark this issue as not planned. The method highly depends on recovery period value and we do not have data to calculate representative values and standard deviation. My idea was to add a script (and CSV file(s)) with an example value in a separate directory (./analysis) from the raw data (cumurative number of confirmed/fatal cases) as I mentioned in the bullet list of the previous comment, but I agree with your comments to exclude totally the estimated data from this repository to avoid any troubles of users. This should be done at software engineering because libraries (packages) can provide interface to selecting recovery period, like df = complement_recovered(raw, recovery_period=21).

(Additionally, in latest.csv, only "_id", "Case_status" and "Location" column have non-NA values. This means we can calculate only cumulative number of cases. Cumulative number of fatal cases is un-avaliable at this time.)

(To simulate the number of cases with SIR-like models, ODE parameter values, including beta and sigma, are necessary. To estimate ODE parameter values, we require linelist or the set of cumulative number of confirmed/recovered/fatal cases. Regarding COVID-19, UK does not provide recovery data. The method and recovery period estimated with the other countries' data are very helpful to analyse UK data with SIR-like model and the same workflow as that for the other countries, surely with caution when analysing.)

@jim-sheldon Thank you for clarifying the directory tree and I'm sorry for my misunderstanding.

jim-sheldon commented 1 year ago

@lisphilar all good, no need to apologize; thank you for the background and explanation!