lisphilar / covid19-sir

CovsirPhy: Python library for COVID-19 analysis with phase-dependent SIR-derived ODE models.
https://lisphilar.github.io/covid19-sir/
Apache License 2.0
109 stars 44 forks source link

[New] Extending support for County level COVID-19 Impact Assessment #851

Closed AnujTiwari closed 3 years ago

AnujTiwari commented 3 years ago

Hi All, Please help me to utilize the CovsirPhy for US County Level COVID-19 Impact Assessment. My aim is to index all 3142 US counties as per the impact of COVID-19.

I can request the local health department for clinical datasets. Please help me to understand the following points:

  1. What are the minimum parameters required to perform the county-level COVID-19 impact assessment using CovsirPhy.
  2. If I can collect county-level vaccination and variant details, is it possible to utilize them with the current model.
  3. What is the best output parameter (like R0) we can use for comparative evaluation of COIVD-19 impact and indexing counties.

I am very much new to the domain of compartmental modeling so, please help me to understand if I am leaving some important details required to extend the model for the county level.

Thanks

lisphilar commented 3 years ago

Thank you for creating issues!

My aim is to index all 3142 US counties as per the impact of COVID-19. I can request the local health department for clinical datasets.

"COVID-19 Hub" (our main data source for JHUData etc.) provides state-level records for US. You can confirm this as follows.

import covsirphy as cs
loader = cs.DataLoader()
jhu_data = loader.jhu()
jhu_data.layer(country="US").Province.unique()

If you have the county-level dataset as a CSV file in your local environment, you can create JHUData instance via CountryData class. Please refer to Use a local CSV file which has the number of cases and regard "county" as "province" when you are using covsirphy library.

  1. What are the minimum parameters required to perform the county-level COVID-19 impact assessment using CovsirPhy.

If I understand this question correctly, you are asking about the required columns of the county-level dataset. You need the following data.

  1. If I can collect county-level vaccination and variant details, is it possible to utilize them with the current model.

At the latest version 2.21.0, it is not implemented. We need some updates of classes.

  1. What is the best output parameter (like R0) we can use for comparative evaluation of COIVD-19 impact and indexing counties.

Yes, we can use R0 as the index, but R0 is calculated with parameter values of ODE models. Please refer to Usage: SIR-derived models and I recommend to use the parameter values directly as the index.

AnujTiwari commented 3 years ago

Thank you lisphilar for clearing this up for me. Regarding vaccination and variant details, whether you are working on the updates? If I can include at least the vaccination dataset that will be a great success for me.

lisphilar commented 3 years ago

Could you provide the details of the dataset you have? I'd like to write a new class as the handler, testing it with the dataset, if available.

I plan to create new class LocalDataLoader tentative name) to handle datasets saved in local environment. Usage may be similar to DataLoaderclass.

lisphilar commented 3 years ago

Note for updating codes: When we create LocalDataLoader class, we should add date_format argument because dates are sometimes parsed incorrectly as discussed in #856.

AnujTiwari commented 3 years ago

Could you provide the details of the dataset you have? I'd like to write a new class as the handler, testing it with the dataset, if available.

  • column names
  • some rows
  • file extension (CSV?)
  • if opened, URL

I plan to create new class LocalDataLoader tentative name) to handle datasets saved in local environment. Usage may be similar to DataLoaderclass.

Hi lisphilar I am working on collecting all the required datasets. Soon update them here for further processing.. Thanks.

AnujTiwari commented 3 years ago

Hi lisphilar, I am in the process of getting the dataset from local health departments. I apologize for the long delay. Meanwhile, please let me know how to export the intermediate maps (snl.trend().summary()) in high resolution .tif image format. Thank you

lisphilar commented 3 years ago

@AnujTiwari , Thank you for letting me know. With some pull requests linked to this issue, I'm preparing to add new features to DataLoader class (LocalDataLoader will not be created).

Meanwhile, please let me know how to export the intermediate maps (snl.trend().summary()) in high resolution .tif image format. Thank you

When we use CovsirPhy with Jupyter Notebook, you can export TIFF files as follows.

snl.interactive = False
snl.trend(filename="image_filename.tiff", dpi=300)

When we use it with Python scripts, snl.interactive = False is un-necessary.

snl.trend() (and all methods which create figures) calls matplotlib.pyplot.savefig() internally. Keyword arguments of matplotlib.pyplot.savefig(), including dpi=300, can be used as the keyword arguments of snl.trend(). If not, that may be a bug.

AnujTiwari commented 3 years ago

Hi lisphilar, As per your suggestions, I have implemented cook county datasets (confirmed, fatal, population and recovered=0). Everything went smooth and I was capable of generating results for cook county with S-R trend analysis phases. But I have some doubts:

  1. My dataset is starting from 24Jan2020 but my S-R trend first phases was starting from 15Mar2020?
  2. When I am trying to replace S-R trend analysis phases with some new change points following instructions at usage:phases, I am getting an error: ValueError: @start must be the same as/over 15Mar2020, but 27Feb2020 was applied. Am I missing something here?

Again, thanks a lot for your help. I am still waiting for getting health data from local health departments. Keep you updated with the developments, But definitely, I am getting vaccination information (%population received 1st and 2nd dose) and looking forward to incorporating vaccination in the reproduction estimation.

Dataset: date | province | confirmed | fatal | recovered 1/24/2020 | Cook | 1 | 0 | 0 1/25/2020 | Cook | 1 | 0 | 0 . . 2/18/2021 | Cook | 467986 | 9775 | 0 2/19/2021 | Cook | 468690 | 9790 | 0 2/20/2021 | Cook | 469379 | 9800 | 0

Code: country_data = cs.CountryData("cook county.csv", country="Cook") country_data.set_variables( date="date", confirmed="confirmed", recovered="recovered", fatal="fatal")

country_data.register_total() country_data.cleaned()

jhu_data_cook = cs.JHUData.from_dataframe(country_data.cleaned()) population_data_cook = cs.PopulationData() population_data_cook.update(5150233, country="Cook")

snl = cs.Scenario(country="Cook") snl.register(jhu_data_cook, population_data_cook) snl.records()

snl.trend(algo="Pelt-rbf").summary() snl.estimate(cs.SIRF)

S-R Phases: 0th phase (15Mar2020 - 24Apr2020): finished 281 trials in 0 min 9 sec 1st phase (25Apr2020 - 18May2020): finished 320 trials in 0 min 10 sec 2nd phase (19May2020 - 08Jul2020): finished 385 trials in 0 min 13 sec 3rd phase (09Jul2020 - 17Aug2020): finished 437 trials in 0 min 15 sec 4th phase (18Aug2020 - 23Sep2020): finished 321 trials in 0 min 10 sec 5th phase (24Sep2020 - 20Oct2020): finished 180 trials in 0 min 5 sec 6th phase (21Oct2020 - 04Nov2020): finished 239 trials in 0 min 7 sec 7th phase (05Nov2020 - 14Nov2020): finished 270 trials in 0 min 8 sec 8th phase (15Nov2020 - 24Nov2020): finished 389 trials in 0 min 13 sec 9th phase (25Nov2020 - 05Dec2020): finished 322 trials in 0 min 10 sec 12th phase (05Jan2021 - 20Jan2021): finished 181 trials in 0 min 5 sec 10th phase (06Dec2020 - 17Dec2020): finished 564 trials in 0 min 22 sec 11th phase (18Dec2020 - 04Jan2021): finished 348 trials in 0 min 11 sec 13th phase (21Jan2021 - 20Feb2021): finished 831 trials in 0 min 28 sec

For New Change Points: snl.clear(include_past=True).summary() snl.add(start_date="28Apr2020")

Error: ValueError: @start must be the same as/over 15Mar2020, but 27Feb2020 was applied.

lisphilar commented 3 years ago
  1. My dataset is starting from 24Jan2020 but my S-R trend first phases was starting from 15Mar2020?

Yes. We cannot apply S-R trend analysis to records with R = 0. This is because we cannot calculate parameter "a" of log(S(R)) = -a*R + logN when R is always 0 e.g. (R, S) = (0, 1000), (0, 999), (0, 998)...(0, 980).

So, all records with R = 0 should be ignored when S-R trend analysis. However, zeros in your dataset are not actual values. Because you did not have actual values of R, zeros were registered as I proposed.

When R values appears not to be actual values, covsirphy.Scenario complements R values via covsirphy.JHUDataComplementHandler class internally. We can check complemented R values with snl.records(variables="R"). This returns a dataframe with dates and complemented R values except for zeros. I guess, minimum date of the returned dataframe was 15Mar2020.

  1. When I am trying to replace S-R trend analysis phases with some new change points following instructions at usage:phases, I am getting an error: ValueError: @start must be the same as/over 15Mar2020, but 27Feb2020 was applied. Am I missing something here?

Please replace snl.add(start_date="28Apr2020") with snl.add(end_date="28Apr2020") to specify the end date, not start date, of the new phase.

AnujTiwari commented 3 years ago
  1. Yes - Fatal dataset is actually starting from 17th Mar 2020. Thanks for clearing out this to me.
  2. I tried snl.add(end_date="28Apr2020") also but I am receiving the same error.

code: country_data = cs.CountryData("cook county.csv", country="Cook") country_data.set_variables( date="date", confirmed="confirmed", recovered="recovered", fatal="fatal")

country_data.register_total() country_data.cleaned()

jhu_data_cook = cs.JHUData.from_dataframe(country_data.cleaned()) population_data_cook = cs.PopulationData() population_data_cook.update(5150233, country="Cook")

snl = cs.Scenario(country="Cook") snl.register(jhu_data_cook, population_data_cook) snl.records()

snl.trend(algo="Pelt-rbf").summary() snl.estimate(cs.SIRF)

snl.clear(include_past=True).summary() snl.add(end_date="28Apr2020") ValueError: @start must be the same as/over 15Mar2020, but 27Feb2020 was applied.

lisphilar commented 3 years ago

Hmm...This seems an internal error. The first date might not be changed correctly when zeros were removed.

  1. What is the output of snl.first_date?
  2. Does snl.timepoint(first_date="15Mar2020"); snl.add(end_date="28Apr2020") solve the problem?
AnujTiwari commented 3 years ago
  1. Output of snl.first_date - 27Feb2020 - I am confused with this date as confirmed cases data is starting from 24th Jan 2020 and fatal cases data is starting from 17th Mar 2020..

  2. snl.timepoints(first_date="15Mar2020"); snl.add(end_date="28Apr2020") is showing the error: ValueError: @start must be the same as/over 01Apr2020, but 15Mar2020 was applied.

AnujTiwari commented 3 years ago

Cook County Data.zip Please find the data for your reference.

lisphilar commented 3 years ago

Thank you for providing the file and I will try to find the cause.

lisphilar commented 3 years ago

With pull request #881, I updated codes to fix the problem. Could you retry your script with the latest development version 2.21.0-kappa?

Please replace the current stable version with the development version in your environent or try development version in new environment.

pip install --upgrade "git+https://github.com/lisphilar/covid19-sir.git#egg=covsirphy"
lisphilar commented 3 years ago

At the latest development verion 2.21.0-kappa-fu2, we can use local datasets with DataLoader class as follows. "recovered" column will not be required in the CSV file if you do not know the actual number of recovered cases.

import covsirphy as cs
# Create DataLoader instance
# Set update_interval=None to skip downloading datasets from remote servers 
loader = cs.DataLoader(update_interval=None)
# Read local CSV files: use pandas.read_csv() internally
loader.read_csv("input/cook county.csv", parse_dates=["date"], dayfirst=False)
# Check the dataframe
print(loader.local)
# Set necessary columns: country name and population
# with pandas.DataFrame.assign() intenally
loader.assign(country="US", population=5150233)
# Check the dataframe
print(loader.local)
# Lock data by specifing column names to use
loader.lock(date="date", country="country", province="province", confirmed="confirmed", fatal="fatal", population="population")
# Check locked data
print(loader.locked)
# Create JHUData instance
jhu_data = loader.jhu()
# Start scenario analysis
snl = cs.Scenario(country="US", province="Cook")
snl.register(jhu_data)
snl.records()

Kindly let me know if you need to read multiple CSV files and they have vaccination data.

AnujTiwari commented 3 years ago

Hi lisphilar, I am still getting the same error when I am trying to add a new change point using snl.add(end_date="28Apr2020") Dataset (an instance of dataset showing initial values): image

Code and Errror: error

I also checked with the standard code and it is working fine (below) but with the extract CSV file (local dataset) it is showing the same error. withouterror

As far as I can understand it is the problem in the 0th phase. Please chk

AnujTiwari commented 3 years ago

I request you to look into this strange thing happening in the code: strange

lisphilar commented 3 years ago

Thank you for your notice and I also confirmed there is something error with first date setting after full complement of the number of recovered cases. https://gist.github.com/lisphilar/1063a31147d50dff252e682673b1e600

I will investigate it and try to fix the problem.

lisphilar commented 3 years ago

By msitake, full-complement was done recurrently when recovered = 0 and timepoint settings were updated. With #884, I tried to fix this problem. The correct first date was 10Feb2020, not 15Mar2020. https://gist.github.com/lisphilar/8f05eee0249571a5a27b2af64b5b2ce2

Please use the latest development version 2.21.0-kappa-fu3.

AnujTiwari commented 3 years ago

That's great - Thanks lisphilar for the immediate fix and all the hard work.

AnujTiwari commented 3 years ago

Hi lisphilar, Is it possible to predict COVID-19 prevalence (symptomatic patients + asymptomatic patients) using the current model? Also, Today I computed R0 for Vietnam and when I compared my results (below) with the epiforecasts Reproduction number [https://epiforecasts.io/covid/posts/national/vietnam/] I found a significant difference between the outcomes of these models. May be you can better differentiate the modeling approaches here.
vietnam

lisphilar commented 3 years ago

They will be discussed with #887 and #888.

AnujTiwari commented 3 years ago

Sure, thanks

AnujTiwari commented 3 years ago

Oh Sorry, by mistake I clicked on the closed issue.

AnujTiwari commented 3 years ago

Hi Lisphilar,

Currently, I am using loader.read_csv("Cook_County.csv", parse_dates=["date"], dayfirst=False) for reading the individual csv files. As, I am fetching cases, fatal, recovered, and population data records from a master CSV file that has the entry for hundreds of counties I would like to replace Pandas dataframe with the CSV file above.

My DataFrame is same as the earlier CSV files I discussed above (cook county.csv). Thanks

lisphilar commented 3 years ago

No worries, thank you for reopening this.

lisphilar commented 3 years ago

Could you try the followings and share the output of loader.local?

loader.read_csv("cook_county.csv")
loader.read_csv("master CSV filename", how_combine="merge", how="left", on="province")
print(loader.local)

This merges dataframes with pandas.merge()` internally.

lisphilar commented 3 years ago

Oops, I might misunderstood the situation. Please share column names of the master file. The master file has county/date/confirmed/fatal/recovered/population for all counties and the indivisual files has country/date/confirmed? Is the difference between cook.csv and the master file whether populaton data is included or not?

(Updated)

AnujTiwari commented 3 years ago

No there is no difference between the master file and data frame. Data frame also has same attributes (county/date/confirmed/fatal/recovered/population).

lisphilar commented 3 years ago

If column name of population data is "population" and recovered data is "recovered", we can do as follows. We can skip loader.assign(population=5150233). Please replace "master file name" with the master filename,

import covsirphy as cs
loader = cs.DataLoader(update_interval=None)
loader.read_csv("master file name", parse_dates=["date"], dayfirst=False)
loader.assign(country="US")
loader.lock(date="date", country="country", province="province", confirmed="confirmed", fatal="fatal", recovered="recovered", population="population")
AnujTiwari commented 3 years ago

Sorry for the miscommunication..I thought you wanted to know about the structure of the DataFrame.

No, actually I am using multiple CSVs to fetch these parameters into the data frame. so I would like to just replace this read_CSV function with something like readDataFrame?

lisphilar commented 3 years ago

Thank you. OK, we have a large pandas.DataFrame which includes many counties' data.

We can call loader.read_csv(filename, how_combine="concat") multiple times to concat many CSV file data, but loader.read_dataframe() will be more useful as you metioned. Please wait for a while.

lisphilar commented 3 years ago

After merging #893, please update covsirphy to version 2.21.0-lambda-fu1 and please relace loader.read_csv("master file name", parse_dates=["date"], dayfirst=False) with loader.read_dataframe(dataframe, parse_dates=["date"], dayfirst=False) if the pandas.DataFrame is assigned to dataframe variable.

AnujTiwari commented 3 years ago

Great - I think it is one of the important updates that will be helpful for the generalisation of the library. Thanks, Lisphilar.. It is pleasure working with you ..

AnujTiwari commented 3 years ago

Hi Lisphilar,

Which trend analysis technique S-R trend is using? Is it possible to access the trend type, p-value, z-value, slope, and other parameters related to the trend results? Also, is it possible to collect the resultant parameters for homogeneity analysis?

lisphilar commented 3 years ago

This will be discussed in #894.

lisphilar commented 3 years ago

@AnujTiwari , Thank you for your continuous supports and discussion. I'm writing documentation of these new features (DataLoader.read_csv(), DataLoader.read_dataframe() and so on) at docs/markdown/INSTALLATION.md of this repository. "Data loading (To Be Released)" chapter will be used. After deployment, this will be shown at https://lisphilar.github.io/covid19-sir/markdown/INSTALLATION.html

Noe that "Dataset preparation" chapter will be integrated to the new chapter after the next stable release 2.22.0.

Please let me know (or create pull requests to edit it) if you have some advice.

AnujTiwari commented 3 years ago

thank you Lisphilar.. Sure I ll look into the Chapter and provide my inputs.

AnujTiwari commented 3 years ago

Hi Lisphilar,

Today I was executing the model for Barbour County and I found a "-" value in the output R0. So far I have implemented the model for less than 10 counties and this is the first time getting such strange reproduction no value.

Errror

Please find the Dataset: Barbour.zip

While if I run snl.trend() rather than manually adding the change points everything works fine.. Errror1

lisphilar commented 3 years ago

Please share the out put of snl.estimate_accuracy("1st"). This is a normal behavior when the records in the phase fit to SIR-F model in-sufficiently.

AnujTiwari commented 3 years ago

while executing snl.estimate_accuracy("1st") I am receiving error message: Please execute ODEHandler.add() in advance. I think I am missing Somthing.

lisphilar commented 3 years ago

Was it called after snl.add() and snl.estimate()?

AnujTiwari commented 3 years ago

Yes - It was called after snl.add() and snl.estimate()

lisphilar commented 3 years ago

Could you share the all script via gist or GitHub repository?

lisphilar commented 3 years ago

I tried codes with records as-of 18Jul2021. https://gist.github.com/lisphilar/a4396dc91a25af30ccda0155be2c0214

None value of Rt: The records in the phases fit to SIR-F model in-sufficiently and kappa was 0 in the two phases and sigma was 0 in the 1st phase. Because Rt is equal to "rho * (1 - theta) / (kappa + sigma)", None was returned. None was converted to "-" with snl.summary() and snl.get().

Error with "snl.estimate_accuracy(): I also confirmed the error when we run it withoutsnl.trend()`. When some parameter values are 0, this error may be raised. Could you create a new issue for this with "Request fixing a bug" template?

AnujTiwari commented 3 years ago

In case when the records in the phase fit to SIR-F model in-sufficient and R0 is None '-', what do you suggest for getting some significant R0 value for the county - whether I should increase the time period and chk for 21 days instead of 14 days or something else ?

Also I am a little concerned about the very high R0 values like 13.59 (for Barbour county) after adding a new record (19th Day records in case of Barbour county). It seems like a small increase in COVID-19 cases/fatal (cases increased by count 2 in Barbour county) is causing a major shift in R0 value.

lisphilar commented 3 years ago

Because the phase does not fit to the model, I recommend to shorten the phase e.g. 7 days.

However, actually I need the outputs of snl.estimate_accuracy() for confirmation and deciding the number of days. This method compares simulated number with estimated parameter values and actual number of cases.

Thank you for creating #895 and I will investigate it with high priority after work on 21Jul2021 JST.

AnujTiwari commented 3 years ago

thank you for the suggestion. Yes- it will be great if it is possible to compute 'minimum number of the days' phase (from the last day) for which records in the phase sufficiently fit to SIR-F model and provide a significant R0 value.. Thanks

AnujTiwari commented 3 years ago

there are many counties where the number of cases and deaths is comparatively less model throws an error. I request you to please test the model with the Wilcox county dataset. Similar to many other counties, I am not able to get the outcome for this county.

import pandas as pd loader = cs.DataLoader(update_interval=None) loader.read_csv("Wilcox1.csv", parse_dates=["date"], dayfirst=False)

print(loader.local)

loader.assign(country="US", province="Wilcox", population=10373)

print(loader.local) loader.lock(date="date", country="country", province="province", confirmed="confirmed", fatal="fatal", population="population") print(loader.locked)

jhu_data = loader.jhu()

snl = cs.Scenario(country="US", province="Wilcox") snl.register(jhu_data) snl.records()

snl.clear(include_past=True) snl.add(end_date="12Jul2021") snl.add(end_date=snl.today) snl.summary()

snl.estimate(cs.SIRF) snl.summary()

ValueError: When the targets have multiple columns, we cannot select RMSLE.

Wilcox1.zip