mponce0 / covid19.analytics

R package to obtain and analyze live data from the nCOVID19 coronavirus
https://mponce0.github.io/covid19.analytics/
GNU General Public License v2.0
35 stars 11 forks source link

covid19.data case = "aggregated" data issue #6

Open ZellW opened 3 years ago

ZellW commented 3 years ago

Running: covid_data_agg_wide <- covid19.data(case = "aggregated", local.data = FALSE, debrief = FALSE) results in a mistake illustrated by this record The US Recovered data is returned:

FIPS: NA Province_State: Recovered Country_Region: US 2020-08-10 04:34:55 Recovered: 1656864 Active: -2458499 Combined_Key: Recovered, US

mponce0 commented 3 years ago

Hi Cliff,

Many thanks for reporting this! I believe this is a misreported data case. Unfortunately this is not the first and I'm afraid the last case of this misreported data. Take a look at the command attached below, it would slice the cases were "Active" cases are reported as negative values -- which of course should not be the case! You will notice several entries, including the "Recovered, US" you found. Now look at the actual data as reported from JHU/CCSEGIS from (line #3621):

https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/08-09-2020.csv#L3621

I have seen different variations around this type of issues and why I decided to add a few fns to determine integrity/consistency checks -- data.checks, consistency.check, integrity.check -- but not decided to "tamper" with the data but just instead let the user know and make it aware of it. Please let me know if there is anything else you found... I can think of alternatives to mitigate this, like including a "floor" for values that should be positive (eg. cumulative quantities)... controlled by an optional argument for instance...

      FIPS     Admin2 Province_State Country_Region         Last_Update     Lat
NA      NA       <NA>           <NA>           <NA>                <NA>      NA
2797 90018 Unassigned        Indiana             US 2020-07-05 04:33:46      NA
NA.1    NA       <NA>           <NA>           <NA>                <NA>      NA
2801 90022 Unassigned      Louisiana             US 2020-07-05 04:33:46      NA
2803 90024 Unassigned       Maryland             US 2020-07-05 04:33:46      NA
2833 90056 Unassigned        Wyoming             US 2020-07-05 04:33:46      NA
3196    NA                 Cantabria          Spain 2020-07-05 04:33:46 43.1828
3207    NA                     Ceuta          Spain 2020-07-05 04:33:46 35.8894
NA.2    NA       <NA>           <NA>           <NA>                <NA>      NA
3246    NA               Extremadura          Spain 2020-07-05 04:33:46 39.4937
3256    NA                   Galicia          Spain 2020-07-05 04:33:46 42.5751
NA.3    NA       <NA>           <NA>           <NA>                <NA>      NA
3407    NA                    Murcia          Spain 2020-07-05 04:33:46 37.9922
3455    NA                Pais Vasco          Spain 2020-07-05 04:33:46 42.9896
3486    NA                 Recovered             US 2020-07-05 04:33:46      NA
3584    NA                   Unknown           Peru 2020-07-05 04:33:46      NA
3798    NA                   Unknown          Chile 2020-07-03 15:33:50      NA
       Long_ Confirmed Deaths Recovered  Active              Combined_Key
NA        NA        NA     NA        NA      NA                      <NA>
2797      NA         0    193         0    -193   Unassigned, Indiana, US
NA.1      NA        NA     NA        NA      NA                      <NA>
2801      NA         7    108         0    -101 Unassigned, Louisiana, US
2803      NA         0     23         0     -23  Unassigned, Maryland, US
2833      NA         0     19         0     -19   Unassigned, Wyoming, US
3196 -3.9878      2364    216      2287    -139          Cantabria, Spain
3207 -5.3213       163      4       163      -4              Ceuta, Spain
NA.2      NA        NA     NA        NA      NA                      <NA>
3246 -6.0679      3047    519      2652    -124        Extremadura, Spain
3256 -8.1339      9251    619      9204    -572            Galicia, Spain
NA.3      NA        NA     NA        NA      NA                      <NA>
3407 -1.1307      1697    148      2180    -631             Murcia, Spain
3455 -2.6189     13809   1561     16160   -3912         Pais Vasco, Spain
3486      NA         0      0    894325 -847695             Recovered, US
3584      NA         0      0    189621 -189621             Unknown, Peru
3798      NA         0      0       109    -109            Unknown, Chile
     Incidence_Rate Case.Fatality_Ratio
NA               NA                  NA
2797             NA                  NA
NA.1             NA                  NA
2801             NA         1542.857143
2803             NA                  NA
2833             NA                  NA
3196       406.4363            9.137056
3207       192.1513            2.453988
NA.2             NA                  NA
3246       285.9894           17.033147
3256       342.5737            6.691169
NA.3             NA                  NA
3407       114.0715            8.721273
3455       634.0570           11.304222
3486             NA                  NA
3584             NA                  NA
3798             NA                  NA

Best, Marcelo

mponce0 commented 3 years ago

Added new features to the integrity.check and consistency.check functions, including a new argument "disclose" which will make the functions report the entries where problems have been detected; e.g. negative cumulative quantities or negative values in reported number of cases.

Added a new function nullify.data() which will remove the "spurious" cases.

So far these changes are available on the development version.