nytimes / covid-19-data

A repository of data on coronavirus cases and deaths in the U.S.
https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html
Other
6.99k stars 3.46k forks source link

Data Issue: us-counties.csv file exceeds 1M lines, can't be imported into Numbers on Mac #540

Closed joecascio closed 3 years ago

joecascio commented 3 years ago

The us-counties file today exceeded 1,000,000 lines. It can't be displayed on Github nor can it be imported into the Numbers app on a Mac (OS version: Catalina), which, to my surprise has a hard capacity limit. Don't know if Excel has a similar limit. At any rate, I think a lot of consumers of this data use Numbers or Excel to do analysis on it. I don't see a solution other than perhaps having a second file that doesn't include the first few months of data.

Being a developer myself, I'll just throw together a little Python preprocessor to lop off the first few months of the csv file before importing it, but that only solves the problem for me, unless I contribute it to your effort, which I'd be happy to do. But, I'm hoping you guys can come up with a more elegant, universal solution.

Anyway, thanks for all the work you've done on this project. I take this data, do a little averaging on the new cases and new death, and chart them for my local Facebook community group.

joecascio commented 3 years ago

Well, I just had a "duh" moment. The easiest solution would be to simply bust out the counties into 50 separate state files, so have AL-counties.csv, AK-counties.csv, AZ-counties.csv, etc. Don't know if you're up for that but it would make processing faster. Even on a high powered iMac, the us-counties file takes many seconds to load, and some seconds to search. I'm only interested in a few counties, so I could open and find those few faster if the states were broken out. Thanks.

tiffehr commented 3 years ago

Thanks for this feedback. We knew we'd hit these limits eventually. Welcome to data growing in proportion to a contagious virus.

We'll discuss options and get back to you!

joecascio commented 3 years ago

I replied to the email notification. I wrote a little Python script last night to bust out the us-counties csv file into separate csv files for each county in a directory tree that looks like the following. The Times is welcome to use it if you like.

./states: Alabama Illinois Nebraska Rhode Island Alaska Indiana Nevada South Carolina Arizona Iowa New Hampshire South Dakota Arkansas Kansas New Jersey Tennessee California Kentucky New Mexico Texas Colorado Louisiana New York Utah Connecticut Maine North Carolina Vermont Delaware Maryland North Dakota Virgin Islands District of Columbia Massachusetts Northern Mariana Islands Virginia Florida Michigan Ohio Washington Georgia Minnesota Oklahoma West Virginia Guam Mississippi Oregon Wisconsin Hawaii Missouri Pennsylvania Wyoming Idaho Montana Puerto Rico

./states/Alabama: Autauga.csv Cherokee.csv Coosa.csv Etowah.csv Jefferson.csv Marengo.csv Pike.csv Walker.csv Baldwin.csv Chilton.csv Covington.csv Fayette.csv Lamar.csv Marion.csv Randolph.csv Washington.csv Barbour.csv Choctaw.csv Crenshaw.csv Franklin.csv Lauderdale.csv Marshall.csv Russell.csv Wilcox.csv Bibb.csv Clarke.csv Cullman.csv Geneva.csv Lawrence.csv Mobile.csv Shelby.csv Winston.csv Blount.csv Clay.csv Dale.csv Greene.csv Lee.csv Monroe.csv St. Clair.csv Bullock.csv Cleburne.csv Dallas.csv Hale.csv Limestone.csv Montgomery.csv Sumter.csv Butler.csv Coffee.csv DeKalb.csv Henry.csv Lowndes.csv Morgan.csv Talladega.csv Calhoun.csv Colbert.csv Elmore.csv Houston.csv Macon.csv Perry.csv Tallapoosa.csv Chambers.csv Conecuh.csv Escambia.csv Jackson.csv Madison.csv Pickens.csv Tuscaloosa.csv

./states/Alaska: Aleutians East Borough.csv Kenai Peninsula Borough.csv Prince of Wales-Hyder Census Area.csv Aleutians West Census Area.csv Ketchikan Gateway Borough.csv Sitka City and Borough.csv Anchorage.csv Kodiak Island Borough.csv Skagway Municipality.csv Bethel Census Area.csv Kusilvak Census Area.csv Southeast Fairbanks Census Area.csv Bristol Bay Borough.csv Lake and Peninsula Borough.csv Unknown.csv Denali Borough.csv Matanuska-Susitna Borough.csv Valdez-Cordova Census Area.csv Dillingham Census Area.csv Nome Census Area.csv Wrangell City and Borough.csv Fairbanks North Star Borough.csv North Slope Borough.csv Yukon-Koyukuk Census Area.csv Haines Borough.csv Northwest Arctic Borough.csv Juneau City and Borough.csv Petersburg Borough.csv

./states/Arizona: Apache.csv Coconino.csv Graham.csv La Paz.csv Mohave.csv Pima.csv Santa Cruz.csv Yavapai.csv Cochise.csv Gila.csv Greenlee.csv Maricopa.csv Navajo.csv Pinal.csv Unknown.csv Yuma.csv

joecascio commented 3 years ago

states.zip This dir tree is what results from running my python program. I added the state csv file to each state directory. This runs straight off the covid-19-data-master directory downloaded from github. Takes about 10 seconds on a 3.5 GHz 16Gbyte memory, Quad-Core iMac running Catalina.

mrolfer commented 3 years ago

Joe

First off thanks a lot for this solution!

So how do I refresh this data each day. After unzipping and Opening a State folder, on Mac, launched the CSV download the first time. What do I do tomorrow.

Regards Michael

mrolfer commented 3 years ago

I consolidated all the CA counties and only show data through Feb 7?

joecascio commented 3 years ago

Joe

First off thanks a lot for this solution!

So how do I refresh this data each day. After unzipping and Opening a State folder, on Mac, launched the CSV download the first time. What do I do tomorrow.

Regards Michael

You run my program every day when the nytimes data is updated and it will rebuild all the files with the latest data included. It takes about 10 seconds to run on my machine. I'm thinking maybe I should post it here on github.

mrolfer commented 3 years ago

Joe

did you see my second comment about data only current as of 02/07/21?

Regards, Michael On Feb 21, 2021, 11:19 AM -0800, Joe Cascio notifications@github.com, wrote:

Joe First off thanks a lot for this solution! So how do I refresh this data each day. After unzipping and Opening a State folder, on Mac, launched the CSV download the first time. What do I do tomorrow. Regards Michael You run my program every day when the nytimes data is updated and it will rebuild all the files with the latest data included. It takes about 10 seconds to run on my machine. I'm thinking maybe I should post it here on github. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

joecascio commented 3 years ago

Joe did you see my second comment about data only current as of 02/07/21? Regards, Michael

Yes. The process is as follows to get new data every day. Download the NYTimes .zip file and unpack it in your working directory where you have the tool residing.

Run my program. In the working directory, it creates (or reuses) a directory called "states", and in there creates a directory for each state or territory, for instance "Connecticut".

In that separate state directory, it will create/overwrite a file named, for instance, "Connecticut-state.csv" that contains the state-level data extracted from us-states.csv for that state.

It also creates/overwrites a csv for each county in that state, for example "New London.csv" that contains the county level data for that county extracted from the us-counties.csv file.

Note that the all the files and directories are overwritten from one day to the next. So if you create new files of your own in a state directory, they will not be destroyed. But I would not advise modifying the program-created files as your changes will be lost the next day when you run the new nytimes data through the tool.

I'm creating a repository for the executable program now. Stay tuned here for a release notice later today. I'm glad someone can use my program!

mcsooks commented 3 years ago

FWIW, while not an automated process, one can also use the open source application OpenRefine to open datasets > 1 million rows, filter what one wants, and export it out as .csv or in several other formats. Not the most optimized software but the #rows it can handle is only limited by the amount of RAM one can throw at it. I also like some of the batch edit/transforms it can do compared to excel/etc.

OpenRefine https://openrefine.org/

joecascio commented 3 years ago

FWIW, while not an automated process, one can also use the open source application OpenRefine to open datasets > 1 million rows, filter what one wants, and export it out as .csv or in several other formats. Not the most optimized software but the #rows it can handle is only limited by the amount of RAM one can throw at it. I also like some of the batch edit/transforms it can do compared to excel/etc.

OpenRefine https://openrefine.org/

Thanks for that pointer. I know there must be lots of neat data manipulation tools like that, but I guess I'm too lazy to search them out when I already know Python. I wrote that code while watching (more like listening to) the Super Bowl. I'm retired, and bored out of my skull stuck in the house during COVID, so I decided to take on that little programming problem. I should have code released here pretty soon.

joecascio commented 3 years ago

I've created a git repository here: It contains the python code and (I hope) sufficient instructions on how to use it.

mrolfer commented 3 years ago

Joe

Given your coding skills would it not be a simple process to just split the full CSV into two CSVs Counties2020.csv and Counties2021.csv?

That way the issue would be fixed without the need for additional manipulations for those of us that are subject to the 1 million limit.

Regards, Michael On Feb 21, 2021, 6:43 PM -0800, Joe Cascio notifications@github.com, wrote:

I've created a git repository here: It contains the python code and (I hope) sufficient instructions on how to use it. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

joecascio commented 3 years ago

Possibly, but then that creates problems for people who need all the data for a state or county.

mrolfer commented 3 years ago

That may be a more complicated solution that NYtimes may need to resolve.

If my suggestion is too much for you to take on I understand.

Regards, Michael On Feb 22, 2021, 8:42 AM -0800, Joe Cascio notifications@github.com, wrote:

Possibly, but then that creates problems for people who need all the data for a state or county. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

joecascio commented 3 years ago

That may be a more complicated solution that NYtimes may need to resolve. If my suggestion is too much for you to take on I understand. Regards, Michael

There was some chatter about doing just that on the latest issue: "The Excel Line Limit is 1,048,576 and today the county file exceeded that. Any progress on splitting the file. One suggestion would be a 2020 file 'archive' and a 'live' 2021 addition. #544"

sbulen commented 3 years ago

I'm not 100% convinced this is a problem for NYT to solve...

If you want county-level data in Excel, you're going to need to filter by state. Plenty of csv filters are out there.

If you want total US data in Excel, you're going to need to use the state-level data already provided.

If NYT were to do something, maybe pre-split county-level data by state. Then you can choose which state is of interest.

Or... Find a tool without a 1M limitation.

I've been using Excel, & pivot all the way down to counties... I bounce it off of a MySQL DB first, to pre-calc new cases/deaths & rolling averages for speed. Ill need to cut over to state-level US data myself. And maybe some separate CA charts.... https://shawnbulen.com/forum/index.php?topic=4.0

mrolfer commented 3 years ago

Joe

Yes I saw that, my suggestion was not original.

Not knowing how much of an effort it would be for you I tossed it up to see if it would fly.

Though I am aware of how long it takes to modify the course of a corporate entity.

Thanks for your contribution!

Regards, Michael On Feb 22, 2021, 9:36 AM -0800, sbulen notifications@github.com, wrote:

I'm not 100% convinced this is a problem for NYT to solve... If you want county-level data in Excel, you're going to need to filter by state. Plenty of csv filters are out there. If you want total US data in Excel, you're going to need to use the state-level data already provided. If NYT were to do something, maybe pre-split county-level data by state. Then you can choose which state is of interest. Or... Find a tool without a 1M limitation. I've been using Excel, & pivot all the way down to counties... I bounce it off of a MySQL DB first, to pre-calc new cases/deaths & rolling averages for speed. Ill need to cut over to state-level US data myself. And maybe some separate CA charts.... https://shawnbulen.com/forum/index.php?topic=4.0 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

tanimislam commented 3 years ago

I can load the CSV data fine into a pandas Dataframe

mrolfer commented 3 years ago

Thank you but I no longer need this data.

Regards, Michael On Jul 31, 2021, 3:46 PM -0700, Tanim Islam @.***>, wrote:

I can load the CSV data fine into a pandas Dataframe — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

davidjayjackson commented 3 years ago

Use R.

On Sat, Jul 31, 2021, 7:31 PM mrolfer @.***> wrote:

Thank you but I no longer need this data.

Regards, Michael On Jul 31, 2021, 3:46 PM -0700, Tanim Islam @.***>, wrote:

I can load the CSV data fine into a pandas Dataframe — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nytimes/covid-19-data/issues/540#issuecomment-890416574, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALDDKGZ3FQHGFYGAHJ4CPPDT2SBVHANCNFSM4XGGG57A .