Closed lisphilar closed 3 years ago
At this time, full complement of recovere data is applied to the datasets of the following countries.
df = jhu_data.show_complement()
print(df.loc[df["Full_recovered"]].Country.tolist())
['Andorra', 'United Arab Emirates', 'American Samoa', 'Antigua and Barbuda', 'Burundi', 'Benin', 'Bahrain', 'Belarus', 'Bermuda', 'Barbados', 'Brunei', 'Bhutan', 'Central African Republic', 'Chile', "Cote d'Ivoire", 'Cameroon', 'Democratic Republic of the Congo', 'Colombia', 'Comoros', 'Cape Verde', 'Cuba', 'Germany', 'Djibouti', 'Dominica', 'Ecuador', 'Egypt', 'Finland', 'Fiji', 'France', 'Gabon', 'United Kingdom', 'Georgia', 'Ghana', 'Gambia', 'Guinea-Bissau', 'Equatorial Guinea', 'Grand Princess', 'Grenada', 'Guam', 'Croatia', 'Iran', 'Iceland', 'Jordan', 'Kyrgyzstan', 'Cambodia', 'Saint Kitts and Nevis', 'Laos', 'Liechtenstein', 'Madagascar', 'Marshall Islands', 'Malta', 'Montenegro', 'Northern Mariana Islands', 'Mauritania', 'MS Zaandam', 'Mauritius', 'Malaysia', 'Namibia', 'Niger', 'Nicaragua', 'Netherlands', 'Norway', 'New Zealand', 'Pakistan', 'Peru', 'Papua New Guinea', 'Puerto Rico', 'Qatar', 'Saudi Arabia', 'Senegal', 'Singapore', 'Solomon Islands', 'San Marino', 'Serbia', 'South Sudan', 'Sao Tome and Principe', 'Suriname', 'Slovenia', 'Sweden', 'Swaziland', 'Seychelles', 'Chad', 'Togo', 'Thailand', 'Timor-Leste', 'Turkey', 'Taiwan', 'Uzbekistan', 'Holy See', 'Saint Vincent and the Grenadines', 'Virgin Islands, U.S.', 'Vanuatu', 'Samoa', 'Yemen', 'Zambia', 'Zimbabwe', 'China']
Created pull request #523.
Reworked full complement conditions by adding _validate_recovery_period() method in JHUDataComplementHandler class, in order to determine if the raw recovered data (for a specific country) is valid or not, in order to apply full complement. Invalid partial recovery periods (elapsed intervals in days) are considered if they are outside the range [7, 90] days and this behavior is exhibited by more than 50% of these elapsed intervals.
Now the countries with full complement are only 24: ['American Samoa', 'Belgium', 'Germany', 'Dominica', 'Fiji', 'France', 'United Kingdom', 'Grand Princess', 'Guam', 'Saint Kitts and Nevis', 'Laos', 'Marshall Islands', 'Northern Mariana Islands', 'MS Zaandam', 'Netherlands', 'Puerto Rico', 'Solomon Islands', 'Sweden', 'Timor-Leste', 'Holy See', 'Saint Vincent and the Grenadines', 'Virgin Islands, U.S.', 'Vanuatu', 'Samoa'].
To sum up: China and Singapore are now not fully complemented. UK, Germany, Netherlands and Sweden keep their full complement. While France and Belgium now will be fully complemented, which is considered as more correct in my opinion, because for example the estimated Rt for France is very large (20-30) while now it lowers to normal values and both have unexpectedly low raw recovered data compared to total cases.
This solution removes the upper barrier of 0.99 condition and now Recovered can approach Confirmed - Fatal without upper limit (when pandemic is stopping), bound to the condition that it results in valid partial recovery periods.
@lisphilar So now perhaps it would be good to include the same validity check for partial elapsed intervals in _calculate_recovery_period_country()
? Otherwise it is not useful to include the invalid ones in the final recovery period calculated in calculate_recovery_period()
.
Thank you very much for your pull request!
Yes, JHUData.calculate_recovery_period()
should remove complemented records.
How about replacing .filter(lambda x: x[self.R].sum() != 0)
in line 421 with some codes using ?
(Removed to avoid circular calling.).show_complement()
results
I found one issue. Something error in status output for full complement. status
seems not synchronized with self.complement_dict
. Do you have any ideas?
jhu_data.show_complement("China") returns
Country |
Province | Monotonic_confirmed | Monotonic_fatal | Monotonic_recovered | Full_recovered | Partial_recovered | |
---|---|---|---|---|---|---|---|
China | - | TRUE | FALSE | TRUE | FALSE | TRUE |
However, jhu_data.records("China")[1]
returns
'monotonic increasing complemented confirmed data and \nfully complemented recovered data'
Yes this happens because the passed df
from _recovered_full()
to _validate_recovery_period()
changes itself inside _validate_recovery_period()
, the diff
column is added and thus after_df.equals(before_df)
is False. That's why I wanted to make a copy to work on, inside _validate_recovery_period()
to avoid such inconsistencies. I will update this.
Regarding calculate_recovery_period()
issue, I thought the following solution:
In _calculate_recovery_period_country()
we could check for recovery period validation and if it is not, then return NaN or -1. Then in _calculate_recovery_period()
we will keep only the valid values or the positive ones and apply only to them int(pd.Series(periods).median())
.
Implemented all the above with pull request #528.
@lisphilar Could you remind me of the logic behind _calculate_recovery_period_country()
? Why projecting R to match with the future C-F leads to recovery period calculation? It's like you say that C-F people go into R only?
Thank you I merged your pull request. Just for future reference, why df was inherited beyond method scope?
We are using JHU dataset which have only C/F/R and assume that all confirmed cases will get outcome (R/F) after recovery period or fatality period elapsed.
This means, if
(C, R) = (100, 0)
on 1st day, and(C, R)
will be (C, R) = (0, 100)
on 18th day.
We assume vise versa. If
(C, R) = (100, 0)
on 1st day, and(C, R) = (0, 100)
on 18th day,recovered period will be calculated as 17 days.
Just for future reference, why df was inherited beyond method scope?
See here: https://stackoverflow.com/questions/51391438/pandas-dataframe-as-an-argument-to-a-function-python https://stackoverflow.com/questions/38895768/python-pandas-dataframe-is-it-pass-by-value-or-pass-by-reference
Regarding the recovery period logic: C, F, R constantly change. What we currently do is calculate C-F and try to match this diff value to some R in the future. So if for example on day 20 we have C=1000, F=100 and R=10, then C-F = 900. Let's say that R = 900 on day 40. Then what we mean by our current implementation, is that those C-F=900 people of day 20 are the same with these R=900 people of day 40. In that case we can say that the recovery period of these people yes is 40-20 = 20 days. But the thing is that these R=900 people could originate from many other different days since C constantly increases and part of them dies. For example these R=900 people on day 40 could consist of: 300 people from those C-F on day 20, another 400 people from day 30 and another 200 people from day 35. How then this works? The trick here is to freeze the population of C, F, R and study them as static, unchanging sets and not actually dynamic? Can this be proved mathematically with general equations?
Thank you for the references for immutable arguments.
Regarding recovery period: Yes, the current implementation is not a exact one. "Another 400 people from day 30 and another 200 people from day 35" are ignored and we use the average value to get statistically correct result. We cannot get information of dynamics from four variable observation and this solution is a "better" solution at this time. We tried to use linelist data to get precise value of recovery period (average of recovery date minus confirmation date for cases), but the number of records was too small.
Dear @Inglezos, as a new issue, could you summarize the abstract of complement and recovery period and document them?
So it is correct to assume temporarily invariable compartments for the analysis and match C-F to future R to extract an approximation of the recovery period right? Yes I will summarize them but some time in the week if this is not urgent :)
Yes, we assume that as you commented. Thank you, we will move forward to #531. This is not an urgent issue, but it would be great if we can close the issue before the next release (planned on 17Jan2021).
Summary
Un-expected full complement is done for subset of JHU data (e.g. China). It is necessary to revise the condition of full complement.
Codes and outputs
Subset without complement
Full complement will be done for subset recovery data in China
Environment