lisphilar / covid19-sir

CovsirPhy: Python library for COVID-19 analysis with phase-dependent SIR-derived ODE models.
https://lisphilar.github.io/covid19-sir/
Apache License 2.0
109 stars 44 forks source link

[Fix] recovered cases in China - _subset_by_area() records selection #484

Closed Inglezos closed 3 years ago

Inglezos commented 3 years ago

Summary

For some reason, in the covid19dh.csv file, the recovered for China exist only for province level records while for "China, -" records they are not accumulated there too. The _subset_by_area() method selects only the "China, -" records when no province has been specified. This leads to the wrong result that recovered for China are zero and thus full complement is then applied, despite the fact that the provinces hold the recovered cases information indeed.

Codes and outputs:

import covsirphy as cs
# Dataset preparation
data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()
# Scenario analysis
chn_scenario = cs.Scenario(jhu_data, population_data, "China")

Environment

lisphilar commented 3 years ago

I tried.

df = jhu_data.cleaned()
sum_df = df.loc[(df["Country"] == "China") & (df["Province"] != "-")].groupby("Date").sum()
sum_df.tail()
cs.line_plot(sum_df, title="Total value of provinces in China", y_integer=True)
Date Confirmed Infected Fatal Recovered
2020/12/30 95876 1282 4781 89813
2020/12/31 95963 1258 4782 89923
2021/1/1 96023 1210 4782 90031
2021/1/2 96086 1203 4784 90099
2021/1/3 96086 1203 4784 90099

Figure_1

chn_scenario = cs.Scenario(jhu_data, population_data, "China")
chn_scenario.records(variables=["Confirmed", "Infected", "Fatal", "Recovered"]).tail()
Date Confirmed Infected Fatal Recovered
2020/12/30 96592 1614 4784 90194
2020/12/31 96673 1579 4788 90306
2021/1/1 96762 1567 4789 90406
2021/1/2 96829 1524 4790 90515
2021/1/3 96829 1428 4790 90611

Figure_1

lisphilar commented 3 years ago

With the results above, I think we can use total value of provinces in China for recovered data in JHUData._cleaning(). Becuase the values of confirmed/fatal are not identical between the first table and the second table, it is recommended to use apply the values of the first table (sum of provinces) as China country level data.

Inglezos commented 3 years ago

Yes I agree, the province data seem more correct and hold all the recovered cases information we need.

lisphilar commented 3 years ago

I created pull request #491. Please review it. However, full complement of recovery data is still performed with China dataset. This may be another issue, but we may need to investigate it. (Could we divide up this work?)

Full complement is performed for many countries as follows.

import covsirphy as cs
data_loader = cs.DataLoader()
jhu_data = loader.jhu()
df = jhu_data.show_complement()
print(df.loc[df["Full_recovered"]].Country.tolist())

['Andorra', 'United Arab Emirates', 'American Samoa', 'Antigua and Barbuda', 'Burundi', 'Benin', 'Bahrain', 'Belarus', 'Bermuda', 'Barbados', 'Brunei', 'Bhutan', 'Chile', "Cote d'Ivoire", 'Cameroon', 'Democratic Republic of the Congo', 'Colombia', 'Comoros', 'Cape Verde', 'Cuba', 'Germany', 'Djibouti', 'Dominica', 'Ecuador', 'Egypt', 'Finland', 'Fiji', 'France', 'Gabon', 'United Kingdom', 'Georgia', 'Ghana', 'Gambia', 'Guinea-Bissau', 'Equatorial Guinea', 'Grand Princess', 'Grenada', 'Guam', 'Croatia', 'Iran', 'Iceland', 'Jordan', 'Kyrgyzstan', 'Cambodia', 'Saint Kitts and Nevis', 'Laos', 'Liechtenstein', 'Madagascar', 'Marshall Islands', 'Malta', 'Montenegro', 'Northern Mariana Islands', 'Mauritania', 'MS Zaandam', 'Mauritius', 'Malaysia', 'Namibia', 'Niger', 'Nicaragua', 'Netherlands', 'Norway', 'New Zealand', 'Pakistan', 'Peru', 'Papua New Guinea', 'Puerto Rico', 'Qatar', 'Saudi Arabia', 'Senegal', 'Singapore', 'Solomon Islands', 'San Marino', 'Serbia', 'South Sudan', 'Sao Tome and Principe', 'Suriname', 'Slovenia', 'Sweden', 'Swaziland', 'Seychelles', 'Chad', 'Togo', 'Thailand', 'Timor-Leste', 'Turkey', 'Taiwan', 'Uzbekistan', 'Holy See', 'Saint Vincent and the Grenadines', 'Virgin Islands, U.S.', 'Vanuatu', 'Samoa', 'Yemen', 'Zambia', 'Zimbabwe', 'China']

Inglezos commented 3 years ago

Sure I will check into this too. I don't think France has full complement though, only partial (it just caught my eye).

lisphilar commented 3 years ago

Do you have "COVID-19 Data Hub" as-of 31Dec2020 (or before)? This appears caused by irregular records in raw dataset from Jan2021 and I found a related issue. https://github.com/covid19datahub/COVID19/issues/145

Inglezos commented 3 years ago

Yes I just realized the same problem with the actual dataset.

lisphilar commented 3 years ago

I confirmed the issue for France has been solved thanks to "COVID-19 Data Hub" with the latest data. (We need not create a GitHub issue for this problem.)

lisphilar commented 3 years ago

I do not think Singapore recovered data needs full complement. How do you think? Can we create a new issue for this problem? (Singapore, China)

country = "Singapore"
cs.line_plot(jhu_data.subset(country).set_index("Date"), f"Subset for {country} without complement")

Figure_1

Inglezos commented 3 years ago

No no, we need to revise the conditions. The problem is the 99% threshold and to identify when it is stopping

Inglezos commented 3 years ago

I confirmed the issue for France has been solved thanks to "COVID-19 Data Hub" with the latest data. (We need not create a GitHub issue for this problem.)

The France issue unfortunately remains: image

Inglezos commented 3 years ago

I notified covid19datahub team for this in covid19datahub/COVID19#145.

lisphilar commented 3 years ago

Thank you for notification to the team. This is also discussed in the original dataset repository. https://github.com/opencovid19-fr/data/issues/564

Inglezos commented 3 years ago

Yes it seems that it depends on when we download the dataset. If the covid19datahub team has applied preprocessing first then we are okay. This has to be handled preferably by the original source opencovid19-fr.

lisphilar commented 3 years ago

We will create a new issue for the threshold of full complement? With debug for China data, it was difficult to select specific value as threshold. Around June, Recovered is near to Confirmed - Fatal because the outbreak ended very quickly according to the dataset.

Inglezos commented 3 years ago

Yes we should. If you have some time please create a new issue, otherwise I will do that later.

lisphilar commented 3 years ago
Inglezos commented 3 years ago

Yes we will continue in 514. I will close this issue.