[Fix] recovered cases in China - _subset_by_area() records selection

Inglezos commented 3 years ago

Summary

For some reason, in the covid19dh.csv file, the recovered for China exist only for province level records while for "China, -" records they are not accumulated there too. The _subset_by_area() method selects only the "China, -" records when no province has been specified. This leads to the wrong result that recovered for China are zero and thus full complement is then applied, despite the fact that the provinces hold the recovered cases information indeed.

Codes and outputs:

import covsirphy as cs
# Dataset preparation
data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()
# Scenario analysis
chn_scenario = cs.Scenario(jhu_data, population_data, "China")

Environment

CovsirPhy version: 2.13.3-iota
Python version: 3.8
Installation: Anaconda/pipenv
System: Windows

lisphilar commented 3 years ago

I tried.

df = jhu_data.cleaned()
sum_df = df.loc[(df["Country"] == "China") & (df["Province"] != "-")].groupby("Date").sum()
sum_df.tail()
cs.line_plot(sum_df, title="Total value of provinces in China", y_integer=True)

Date	Confirmed	Infected	Fatal	Recovered
2020/12/30	95876	1282	4781	89813
2020/12/31	95963	1258	4782	89923
2021/1/1	96023	1210	4782	90031
2021/1/2	96086	1203	4784	90099
2021/1/3	96086	1203	4784	90099

Figure_1

chn_scenario = cs.Scenario(jhu_data, population_data, "China")
chn_scenario.records(variables=["Confirmed", "Infected", "Fatal", "Recovered"]).tail()

Date	Confirmed	Infected	Fatal	Recovered
2020/12/30	96592	1614	4784	90194
2020/12/31	96673	1579	4788	90306
2021/1/1	96762	1567	4789	90406
2021/1/2	96829	1524	4790	90515
2021/1/3	96829	1428	4790	90611

Figure_1

lisphilar commented 3 years ago

With the results above, I think we can use total value of provinces in China for recovered data in JHUData._cleaning(). Becuase the values of confirmed/fatal are not identical between the first table and the second table, it is recommended to use apply the values of the first table (sum of provinces) as China country level data.

Inglezos commented 3 years ago

Yes I agree, the province data seem more correct and hold all the recovered cases information we need.

lisphilar commented 3 years ago

I created pull request #491. Please review it. However, full complement of recovery data is still performed with China dataset. This may be another issue, but we may need to investigate it. (Could we divide up this work?)

Full complement is performed for many countries as follows.

import covsirphy as cs
data_loader = cs.DataLoader()
jhu_data = loader.jhu()
df = jhu_data.show_complement()
print(df.loc[df["Full_recovered"]].Country.tolist())

['Andorra', 'United Arab Emirates', 'American Samoa', 'Antigua and Barbuda', 'Burundi', 'Benin', 'Bahrain', 'Belarus', 'Bermuda', 'Barbados', 'Brunei', 'Bhutan', 'Chile', "Cote d'Ivoire", 'Cameroon', 'Democratic Republic of the Congo', 'Colombia', 'Comoros', 'Cape Verde', 'Cuba', 'Germany', 'Djibouti', 'Dominica', 'Ecuador', 'Egypt', 'Finland', 'Fiji', 'France', 'Gabon', 'United Kingdom', 'Georgia', 'Ghana', 'Gambia', 'Guinea-Bissau', 'Equatorial Guinea', 'Grand Princess', 'Grenada', 'Guam', 'Croatia', 'Iran', 'Iceland', 'Jordan', 'Kyrgyzstan', 'Cambodia', 'Saint Kitts and Nevis', 'Laos', 'Liechtenstein', 'Madagascar', 'Marshall Islands', 'Malta', 'Montenegro', 'Northern Mariana Islands', 'Mauritania', 'MS Zaandam', 'Mauritius', 'Malaysia', 'Namibia', 'Niger', 'Nicaragua', 'Netherlands', 'Norway', 'New Zealand', 'Pakistan', 'Peru', 'Papua New Guinea', 'Puerto Rico', 'Qatar', 'Saudi Arabia', 'Senegal', 'Singapore', 'Solomon Islands', 'San Marino', 'Serbia', 'South Sudan', 'Sao Tome and Principe', 'Suriname', 'Slovenia', 'Sweden', 'Swaziland', 'Seychelles', 'Chad', 'Togo', 'Thailand', 'Timor-Leste', 'Turkey', 'Taiwan', 'Uzbekistan', 'Holy See', 'Saint Vincent and the Grenadines', 'Virgin Islands, U.S.', 'Vanuatu', 'Samoa', 'Yemen', 'Zambia', 'Zimbabwe', 'China']

Inglezos commented 3 years ago

Sure I will check into this too. I don't think France has full complement though, only partial (it just caught my eye).

lisphilar commented 3 years ago

Do you have "COVID-19 Data Hub" as-of 31Dec2020 (or before)? This appears caused by irregular records in raw dataset from Jan2021 and I found a related issue. https://github.com/covid19datahub/COVID19/issues/145

Inglezos commented 3 years ago

Yes I just realized the same problem with the actual dataset.

lisphilar commented 3 years ago

I confirmed the issue for France has been solved thanks to "COVID-19 Data Hub" with the latest data. (We need not create a GitHub issue for this problem.)

lisphilar commented 3 years ago

I do not think Singapore recovered data needs full complement. How do you think? Can we create a new issue for this problem? (Singapore, China)

country = "Singapore"
cs.line_plot(jhu_data.subset(country).set_index("Date"), f"Subset for {country} without complement")

Figure_1

Inglezos commented 3 years ago

No no, we need to revise the conditions. The problem is the 99% threshold and to identify when it is stopping

Inglezos commented 3 years ago

I confirmed the issue for France has been solved thanks to "COVID-19 Data Hub" with the latest data. (We need not create a GitHub issue for this problem.)

The France issue unfortunately remains:

Inglezos commented 3 years ago

I notified covid19datahub team for this in covid19datahub/COVID19#145.

lisphilar commented 3 years ago

Thank you for notification to the team. This is also discussed in the original dataset repository. https://github.com/opencovid19-fr/data/issues/564

Inglezos commented 3 years ago

Yes it seems that it depends on when we download the dataset. If the covid19datahub team has applied preprocessing first then we are okay. This has to be handled preferably by the original source opencovid19-fr.

lisphilar commented 3 years ago

We will create a new issue for the threshold of full complement? With debug for China data, it was difficult to select specific value as threshold. Around June, Recovered is near to Confirmed - Fatal because the outbreak ended very quickly according to the dataset.

Inglezos commented 3 years ago

Yes we should. If you have some time please create a new issue, otherwise I will do that later.

lisphilar commented 3 years ago

We will move to #514 regarding the problem with full complement.
We will keep eyes on France data with this issue (or create a new issue before release 2.15.0).

Inglezos commented 3 years ago

Yes we will continue in 514. I will close this issue.

lisphilar / covid19-sir