[Fix] un-expected full complement of JHU data (e.g. China)

lisphilar commented 3 years ago

Summary

Un-expected full complement is done for subset of JHU data (e.g. China). It is necessary to revise the condition of full complement.

Codes and outputs

Subset without complement

import covsirphy as cs
# Dataset preparation
data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu()
# Without full complement (seem no need to full complement)
cs.line_plot(
    jhu_data.subset("China").set_index("Date"), title="Subset for China without complement", y_integer=True)

Figure_1

Full complement will be done for subset recovery data in China

# Applied complement
jhu_data.show_complement("China").iloc[:, 2:]

Monotonic_confirmed	Monotonic_fatal	Monotonic_recovered	Full_recovered	Partial_recovered
TRUE	FALSE	TRUE	TRUE	TRUE

Environment

CovsirPhy version: 2.14.0-delta
Python version: 3.8.5
Installation: poetry
System: WSL (Ubuntu)

lisphilar commented 3 years ago

At this time, full complement of recovere data is applied to the datasets of the following countries.

df = jhu_data.show_complement()
print(df.loc[df["Full_recovered"]].Country.tolist())

['Andorra', 'United Arab Emirates', 'American Samoa', 'Antigua and Barbuda', 'Burundi', 'Benin', 'Bahrain', 'Belarus', 'Bermuda', 'Barbados', 'Brunei', 'Bhutan', 'Central African Republic', 'Chile', "Cote d'Ivoire", 'Cameroon', 'Democratic Republic of the Congo', 'Colombia', 'Comoros', 'Cape Verde', 'Cuba', 'Germany', 'Djibouti', 'Dominica', 'Ecuador', 'Egypt', 'Finland', 'Fiji', 'France', 'Gabon', 'United Kingdom', 'Georgia', 'Ghana', 'Gambia', 'Guinea-Bissau', 'Equatorial Guinea', 'Grand Princess', 'Grenada', 'Guam', 'Croatia', 'Iran', 'Iceland', 'Jordan', 'Kyrgyzstan', 'Cambodia', 'Saint Kitts and Nevis', 'Laos', 'Liechtenstein', 'Madagascar', 'Marshall Islands', 'Malta', 'Montenegro', 'Northern Mariana Islands', 'Mauritania', 'MS Zaandam', 'Mauritius', 'Malaysia', 'Namibia', 'Niger', 'Nicaragua', 'Netherlands', 'Norway', 'New Zealand', 'Pakistan', 'Peru', 'Papua New Guinea', 'Puerto Rico', 'Qatar', 'Saudi Arabia', 'Senegal', 'Singapore', 'Solomon Islands', 'San Marino', 'Serbia', 'South Sudan', 'Sao Tome and Principe', 'Suriname', 'Slovenia', 'Sweden', 'Swaziland', 'Seychelles', 'Chad', 'Togo', 'Thailand', 'Timor-Leste', 'Turkey', 'Taiwan', 'Uzbekistan', 'Holy See', 'Saint Vincent and the Grenadines', 'Virgin Islands, U.S.', 'Vanuatu', 'Samoa', 'Yemen', 'Zambia', 'Zimbabwe', 'China']

Inglezos commented 3 years ago

Created pull request #523.

Reworked full complement conditions by adding _validate_recovery_period() method in JHUDataComplementHandler class, in order to determine if the raw recovered data (for a specific country) is valid or not, in order to apply full complement. Invalid partial recovery periods (elapsed intervals in days) are considered if they are outside the range [7, 90] days and this behavior is exhibited by more than 50% of these elapsed intervals.

Now the countries with full complement are only 24: ['American Samoa', 'Belgium', 'Germany', 'Dominica', 'Fiji', 'France', 'United Kingdom', 'Grand Princess', 'Guam', 'Saint Kitts and Nevis', 'Laos', 'Marshall Islands', 'Northern Mariana Islands', 'MS Zaandam', 'Netherlands', 'Puerto Rico', 'Solomon Islands', 'Sweden', 'Timor-Leste', 'Holy See', 'Saint Vincent and the Grenadines', 'Virgin Islands, U.S.', 'Vanuatu', 'Samoa'].

To sum up: China and Singapore are now not fully complemented. UK, Germany, Netherlands and Sweden keep their full complement. While France and Belgium now will be fully complemented, which is considered as more correct in my opinion, because for example the estimated Rt for France is very large (20-30) while now it lowers to normal values and both have unexpectedly low raw recovered data compared to total cases.

This solution removes the upper barrier of 0.99 condition and now Recovered can approach Confirmed - Fatal without upper limit (when pandemic is stopping), bound to the condition that it results in valid partial recovery periods.

Inglezos commented 3 years ago

@lisphilar So now perhaps it would be good to include the same validity check for partial elapsed intervals in _calculate_recovery_period_country()? Otherwise it is not useful to include the invalid ones in the final recovery period calculated in calculate_recovery_period().

lisphilar commented 3 years ago

Thank you very much for your pull request! Yes, JHUData.calculate_recovery_period() should remove complemented records. How about replacing .filter(lambda x: x[self.R].sum() != 0) in line 421 with some codes ~~using .show_complement() results~~? (Removed to avoid circular calling.)

I found one issue. Something error in status output for full complement. status seems not synchronized with self.complement_dict. Do you have any ideas?

`jhu_data.show_complement("China")` returns Country	Province	Monotonic_confirmed	Monotonic_fatal	Monotonic_recovered	Full_recovered	Partial_recovered
China	-	TRUE	FALSE	TRUE	FALSE	TRUE

However, jhu_data.records("China")[1] returns

'monotonic increasing complemented confirmed data and \nfully complemented recovered data'

Inglezos commented 3 years ago

Yes this happens because the passed df from _recovered_full() to _validate_recovery_period() changes itself inside _validate_recovery_period(), the diff column is added and thus after_df.equals(before_df) is False. That's why I wanted to make a copy to work on, inside _validate_recovery_period() to avoid such inconsistencies. I will update this.

Inglezos commented 3 years ago

Regarding calculate_recovery_period() issue, I thought the following solution: In _calculate_recovery_period_country() we could check for recovery period validation and if it is not, then return NaN or -1. Then in _calculate_recovery_period() we will keep only the valid values or the positive ones and apply only to them int(pd.Series(periods).median()).

Inglezos commented 3 years ago

Implemented all the above with pull request #528.

Inglezos commented 3 years ago

@lisphilar Could you remind me of the logic behind _calculate_recovery_period_country()? Why projecting R to match with the future C-F leads to recovery period calculation? It's like you say that C-F people go into R only?

lisphilar commented 3 years ago

Thank you I merged your pull request. Just for future reference, why df was inherited beyond method scope?

We are using JHU dataset which have only C/F/R and assume that all confirmed cases will get outcome (R/F) after recovery period or fatality period elapsed.

This means, if

F is always 0 (to simplify),
no new cases will be confirmed,
(C, R) = (100, 0) on 1st day, and
recovered period is 17 days,

(C, R) will be (C, R) = (0, 100) on 18th day.

We assume vise versa. If

F is always 0 (to simplify),
no new cases will be confirmed,
(C, R) = (100, 0) on 1st day, and
(C, R) = (0, 100) on 18th day,

recovered period will be calculated as 17 days.

Inglezos commented 3 years ago

Just for future reference, why df was inherited beyond method scope?

See here: https://stackoverflow.com/questions/51391438/pandas-dataframe-as-an-argument-to-a-function-python https://stackoverflow.com/questions/38895768/python-pandas-dataframe-is-it-pass-by-value-or-pass-by-reference

Regarding the recovery period logic: C, F, R constantly change. What we currently do is calculate C-F and try to match this diff value to some R in the future. So if for example on day 20 we have C=1000, F=100 and R=10, then C-F = 900. Let's say that R = 900 on day 40. Then what we mean by our current implementation, is that those C-F=900 people of day 20 are the same with these R=900 people of day 40. In that case we can say that the recovery period of these people yes is 40-20 = 20 days. But the thing is that these R=900 people could originate from many other different days since C constantly increases and part of them dies. For example these R=900 people on day 40 could consist of: 300 people from those C-F on day 20, another 400 people from day 30 and another 200 people from day 35. How then this works? The trick here is to freeze the population of C, F, R and study them as static, unchanging sets and not actually dynamic? Can this be proved mathematically with general equations?

lisphilar commented 3 years ago

Thank you for the references for immutable arguments.

Regarding recovery period: Yes, the current implementation is not a exact one. "Another 400 people from day 30 and another 200 people from day 35" are ignored and we use the average value to get statistically correct result. We cannot get information of dynamics from four variable observation and this solution is a "better" solution at this time. We tried to use linelist data to get precise value of recovery period (average of recovery date minus confirmation date for cases), but the number of records was too small.

lisphilar commented 3 years ago

Dear @Inglezos, as a new issue, could you summarize the abstract of complement and recovery period and document them?

example/usage_dataset.ipynb, https://lisphilar.github.io/covid19-sir/usage_dataset.html#The-number-of-cases-(JHU-style)
docs/markdown/TERM.md, https://lisphilar.github.io/covid19-sir/TERM.html

Inglezos commented 3 years ago

So it is correct to assume temporarily invariable compartments for the analysis and match C-F to future R to extract an approximation of the recovery period right? Yes I will summarize them but some time in the week if this is not urgent :)

lisphilar commented 3 years ago

Yes, we assume that as you commented. Thank you, we will move forward to #531. This is not an urgent issue, but it would be great if we can close the issue before the next release (planned on 17Jan2021).

lisphilar / covid19-sir