Open ZellW opened 3 years ago
Hi Cliff,
Many thanks for reporting this! I believe this is a misreported data case. Unfortunately this is not the first and I'm afraid the last case of this misreported data. Take a look at the command attached below, it would slice the cases were "Active" cases are reported as negative values -- which of course should not be the case! You will notice several entries, including the "Recovered, US" you found. Now look at the actual data as reported from JHU/CCSEGIS from (line #3621):
I have seen different variations around this type of issues and why I decided to add a few fns to determine integrity/consistency checks -- data.checks, consistency.check, integrity.check -- but not decided to "tamper" with the data but just instead let the user know and make it aware of it. Please let me know if there is anything else you found... I can think of alternatives to mitigate this, like including a "floor" for values that should be positive (eg. cumulative quantities)... controlled by an optional argument for instance...
FIPS Admin2 Province_State Country_Region Last_Update Lat
NA NA <NA> <NA> <NA> <NA> NA
2797 90018 Unassigned Indiana US 2020-07-05 04:33:46 NA
NA.1 NA <NA> <NA> <NA> <NA> NA
2801 90022 Unassigned Louisiana US 2020-07-05 04:33:46 NA
2803 90024 Unassigned Maryland US 2020-07-05 04:33:46 NA
2833 90056 Unassigned Wyoming US 2020-07-05 04:33:46 NA
3196 NA Cantabria Spain 2020-07-05 04:33:46 43.1828
3207 NA Ceuta Spain 2020-07-05 04:33:46 35.8894
NA.2 NA <NA> <NA> <NA> <NA> NA
3246 NA Extremadura Spain 2020-07-05 04:33:46 39.4937
3256 NA Galicia Spain 2020-07-05 04:33:46 42.5751
NA.3 NA <NA> <NA> <NA> <NA> NA
3407 NA Murcia Spain 2020-07-05 04:33:46 37.9922
3455 NA Pais Vasco Spain 2020-07-05 04:33:46 42.9896
3486 NA Recovered US 2020-07-05 04:33:46 NA
3584 NA Unknown Peru 2020-07-05 04:33:46 NA
3798 NA Unknown Chile 2020-07-03 15:33:50 NA
Long_ Confirmed Deaths Recovered Active Combined_Key
NA NA NA NA NA NA <NA>
2797 NA 0 193 0 -193 Unassigned, Indiana, US
NA.1 NA NA NA NA NA <NA>
2801 NA 7 108 0 -101 Unassigned, Louisiana, US
2803 NA 0 23 0 -23 Unassigned, Maryland, US
2833 NA 0 19 0 -19 Unassigned, Wyoming, US
3196 -3.9878 2364 216 2287 -139 Cantabria, Spain
3207 -5.3213 163 4 163 -4 Ceuta, Spain
NA.2 NA NA NA NA NA <NA>
3246 -6.0679 3047 519 2652 -124 Extremadura, Spain
3256 -8.1339 9251 619 9204 -572 Galicia, Spain
NA.3 NA NA NA NA NA <NA>
3407 -1.1307 1697 148 2180 -631 Murcia, Spain
3455 -2.6189 13809 1561 16160 -3912 Pais Vasco, Spain
3486 NA 0 0 894325 -847695 Recovered, US
3584 NA 0 0 189621 -189621 Unknown, Peru
3798 NA 0 0 109 -109 Unknown, Chile
Incidence_Rate Case.Fatality_Ratio
NA NA NA
2797 NA NA
NA.1 NA NA
2801 NA 1542.857143
2803 NA NA
2833 NA NA
3196 406.4363 9.137056
3207 192.1513 2.453988
NA.2 NA NA
3246 285.9894 17.033147
3256 342.5737 6.691169
NA.3 NA NA
3407 114.0715 8.721273
3455 634.0570 11.304222
3486 NA NA
3584 NA NA
3798 NA NA
Best, Marcelo
Added new features to the integrity.check and consistency.check functions, including a new argument "disclose" which will make the functions report the entries where problems have been detected; e.g. negative cumulative quantities or negative values in reported number of cases.
Added a new function nullify.data() which will remove the "spurious" cases.
So far these changes are available on the development version.
Running:
covid_data_agg_wide <- covid19.data(case = "aggregated", local.data = FALSE, debrief = FALSE)
results in a mistake illustrated by this record The US Recovered data is returned:FIPS: NA Province_State: Recovered Country_Region: US 2020-08-10 04:34:55 Recovered: 1656864 Active: -2458499 Combined_Key: Recovered, US