nytimes / covid-19-data

A repository of data on coronavirus cases and deaths in the U.S.
https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html
Other
6.99k stars 3.46k forks source link

Data Issue: Blanks for Puerto Rico deaths #507

Closed dgglynn closed 3 years ago

dgglynn commented 3 years ago

Describe the issue:

Fuller details

For current us-counties.csv (as of 11/29/2020) lines 117488 to 117562, covering Puerto Rico for 2020-05-05 have trailing commas that are inconsistent with the format of the rest of the lines in the csv file.

The 11/26/2020 version of us-counties.csv does not have these trailing commas for the entries for Puerto Rico starting at line 17488.

Cursory inspection seems to indicate that entries in the current us-counties.csv for Puerto Rico seem to have the trailing commas for other dates after 2020-05-05 lines. I did not confirm that it is all entries for Puerto Rico listed in current

2020-05-5 entries for Puerto Rico are the first ones that have a "county" entry(earlier entries for Puerto Rico appear to be aggregate data and the county field is labeled unknown).

This difference in line formatting broke importing of the csv file into my influxdb instance.

SomervilleTom commented 3 years ago

The change to simply DROP "deaths" data for PR is a breaking change. that breaks the ingestion process of my site as well.

I suggest that it would be better to remove ALL the municipio data for PR than to include data that is consistent with the rest of the file. Having a header that includes a field (deaths) and then providing a few hundred rows that end with the field separator (out of 776,197) is not acceptable. Either that or provide a distinguished value (such a "-1") for the missing field.

albertsun commented 3 years ago

Hi folks, apologies for the change breaking ingestion processes. The trailing commas indicates that the value for deaths for all Puerto Rico municipios is blank, which should be distinct from "0". We do generally use blanks for missing values rather than a -1 or anything like that. Missing FIPS codes are handled the same, as is the live/us-counties.csv file.

This change was requested in this other issue https://github.com/nytimes/covid-19-data/issues/457

We can't remove all the municipio data for Puerto Rico as the cases values are all still valid.

SomervilleTom commented 3 years ago

First, let me just express my appreciation to the entire NYT team for providing this invaluable resource.

I guess there's just no good way around the issue -- if the data isn't there, it isn't there. I'll refactor my ingestion process to be more robust. I appreciate the quick response to my earlier cfomment.

gnewman7 commented 3 years ago

NYTimes this data is a great resource. Thank you for sharing. I too ran into the same ingestion issue with the Puerto Rico entries. I will correct my MySQL Load scripts if the death values are going to stay as blanks.

MySQL Load Script Workaround to correct the issue with the following check if deaths column contains blank and then set to null.

LOAD DATA 
LOCAL INFILE '../covid-19-data/us-counties.csv' REPLACE 
INTO TABLE covid.us_counties
FIELDS TERMINATED BY ','  
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(@date,county,state,fips,cases,@deaths)
SET date_time = CONCAT(@date, ' 17:00:00'),
deaths = if(@deaths = ' ', null, @deaths);

See Grafana Covid Dashboard Project https://github.com/gnewman7/us-covid-19-dashboards

mhowe0422 commented 3 years ago

I've already made the load change. Thanks for the data, keeping us busy doing our own tracking.

Mark Pittsburgh PA


From: gnewman7 notifications@github.com Sent: Monday, November 30, 2020 10:27 PM To: nytimes/covid-19-data covid-19-data@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [nytimes/covid-19-data] Data Issue: Blanks for Puerto Rico deaths (#507)

NYTimes this data is a great resource. Thank you for sharing. I too ran into the same ingestion issue with the Puerto Rico entries. I will correct my MySQL Load scripts if the death values are going to stay as blanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnytimes%2Fcovid-19-data%2Fissues%2F507%23issuecomment-736193195&data=04%7C01%7C%7C585a589e03f341bb555c08d895a8ffed%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637423900358334026%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=d5GdsQinHkL7UAVKvlVZUQRhnT38hrav0Njvooqhyeo%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABGFUYFPKHY7GEGCZDKV6FLSSRPBFANCNFSM4UGW5RAA&data=04%7C01%7C%7C585a589e03f341bb555c08d895a8ffed%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637423900358344021%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5sNwPg1FeJwyoepLGWJbJ%2F96IVlLISGywVMIAHoCJm8%3D&reserved=0.

albertsun commented 3 years ago

Appreciate you all for bearing with us through the format change.