nytimes / covid-19-data

A repository of data on coronavirus cases and deaths in the U.S.
https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html
Other
6.99k stars 3.46k forks source link

The Excel Line Limit is 1,048,576 and today the county file exceeded that. Any progress on splitting the file. One suggestion would be a 2020 file 'archive' and a 'live' 2021 addition. #544

Closed MirrorCola closed 2 years ago

MirrorCola commented 3 years ago

Describe the issue:

Fuller details

A clear and concise description of the problem, with examples if possible. If you are reporting incorrect data for a specific locality, please include a link to your source. We will compare with our own list of sources. Please check the README and recent commit messages for updates to see if your issue is addressed.

trajanmcgill commented 3 years ago

Just don't forget those of us whose data processes don't use Excel. An additional form of listing data would be fine, but yanking the existing file and replacing it with two would break things for a lot of people. Maybe even better than adding more data formats might be offering a script (or a couple of them for different scripting environments) that can cut up the data in the main file on the user side.

MirrorCola commented 3 years ago

Trajan, fair enough. Another thought that might solve both problems is to have an all-inclusive file and a separate 2021-only version. Would that work for you? I know it increases the work for NYT staff though.

davidjayjackson commented 3 years ago

I import the Covid-19 data into a Sqlite database.

On Sun, Feb 21, 2021, 1:18 PM MirrorCola notifications@github.com wrote:

Trajan, fair enough. Another thought that might solve both problems is to have an all-inclusive file and a separate 2021-only version. Would that work for you? I know it increases the work for NYT staff though.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nytimes/covid-19-data/issues/544#issuecomment-782901106, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALDDKG6QAJN3P4ELRV6J53DTAFE5TANCNFSM4X643W7A .

tiffehr commented 3 years ago

We're working on what we'll do, here. Back to you soon.

mhowe0422 commented 3 years ago

I bulk the county file into a DB and then use Excel Data options to view selected portions. I'd like to see a continue complete file.


From: Tiff Fehr notifications@github.com Sent: Sunday, February 21, 2021 2:51 PM To: nytimes/covid-19-data covid-19-data@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [nytimes/covid-19-data] The Excel Line Limit is 1,048,576 and today the county file exceeded that. Any progress on splitting the file. One suggestion would be a 2020 file 'archive' and a 'live' 2021 addition. (#544)

We're working on what we'll do, here. Back to you soon.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnytimes%2Fcovid-19-data%2Fissues%2F544%23issuecomment-782914954&data=04%7C01%7C%7Ce2df9b74997b43daa99308d8d6a20f99%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637495338814589297%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=tF7yY0IMHOULkrA29kyRxDLLEv2LvX1Ah7ccxcNi71E%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABGFUYHOGTN775Q34PRBPOLTAFP3PANCNFSM4X643W7A&data=04%7C01%7C%7Ce2df9b74997b43daa99308d8d6a20f99%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637495338814589297%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ftIwEXwjtl%2Fqo2%2BgxncEyiAia8j0cxjd1CJT4iOtdcI%3D&reserved=0.

joecascio commented 3 years ago

I have written a python3 script that busts all the state and county rows into their own csv files. It creates a directory structure as follows:

your-working-directory/
    states/
        Alabama/
            Alabama-state.csv
            Autauga.csv
            Baldwin.csv
            Barbour.csv
            etc...
        Alaska/
            Alaska-state.csv
            Aleutians East Borough.csv
            Aleutians West Census Area.csv
            Anchorage.csv
            etc...for each county in the state
        etc... for each state or territory

See this repository to get the program. I just released it, so if you have questions, drop an issue on my repository. Hope it helps you all out.

AZIQB commented 3 years ago

One easiest solution is to go to https://github.com/nytimes/covid-19-data/blob/master/us-counties.csv >> Right click on "View raw" >> click on "Save Link as" >> In the file explorer save the file as "us-counties.csv" (Add .csv extension to the name)and it will save the csv file in your folder. You can then use any statistical program to import this csv file from your local folders. Thanks

ddenenberg71 commented 3 years ago

Please be a bit careful with the concept of 2020 data as an "archive". Just within the last two weeks there have been, at least, three sets of revisions that touched 2020 data. I think it was the Alaska change that resulted in revisions all the way back to May 2020. At least two other sets of revisions recently touched November and December 2020.

petra41 commented 3 years ago

One easiest solution is to go to https://github.com/nytimes/covid-19-data/blob/master/us-counties.csv >> Right click on "View raw" >> click on "Save Link as" >> In the file explorer save the file as "us-counties.csv" (Add .csv extension to the name)and it will save the csv file in your folder. You can then use any statistical program to import this csv file from your local folders. Thanks

One easiest solution is to go to https://github.com/nytimes/covid-19-data/blob/master/us-counties.csv >> Right click on "View raw" >> click on "Save Link as" >> In the file explorer save the file as "us-counties.csv" (Add .csv extension to the name)and it will save the csv file in your folder. You can then use any statistical program to import this csv file from your local folders. Thanks

This does not seem to work as it still only allows viewing a portion of the data.

gmillikan commented 3 years ago

To get around the 1 million rows limit in Excel just import the CSV into Excel using the Microsoft built in (and free) Power Query module: https://www.masterdataanalysis.com/ms-excel/analyzing-50-million-records-excel/

MirrorCola commented 3 years ago

gmillikan, Thank you, I am still fumbling with the Power Query, but got it to work for my purposes. Much appreciated!

AZIQB commented 3 years ago

One easiest solution is to go to https://github.com/nytimes/covid-19-data/blob/master/us-counties.csv >> Right click on "View raw" >> click on "Save Link as" >> In the file explorer save the file as "us-counties.csv" (Add .csv extension to the name)and it will save the csv file in your folder. You can then use any statistical program to import this csv file from your local folders. Thanks

One easiest solution is to go to https://github.com/nytimes/covid-19-data/blob/master/us-counties.csv >> Right click on "View raw" >> click on "Save Link as" >> In the file explorer save the file as "us-counties.csv" (Add .csv extension to the name)and it will save the csv file in your folder. You can then use any statistical program to import this csv file from your local folders. Thanks

This does not seem to work as it still only allows viewing a portion of the data.

Thank you - You are right Excel cannot even open this csv file but you can use another software like SAS, R or Tableau to import this data file and filter by dates or counties or states you need.

joecascio commented 3 years ago

For the folks having trouble with Excel and Numbers, do you need all the data for all the counties at the same time?

sbulen commented 3 years ago

It was sure nice to be able to pivot & drill at will earlier....

I load a mysql db, to be able to pre-calc new cases/deaths & rolling averages.

I split it into two workbooks now, a US level .xls & a state-level .xls. The state-level is driven by a parameterized query, so I can change the state at will & keep all the charts in sync.

The side benefit is I'm not working with a 100mb spreadsheet anymore! https://shawnbulen.com/forum/index.php?topic=4.0

joecascio commented 3 years ago

I don't know if you saw my earlier comments, but would having each county separated into its own csv file be a help or cause you more work? Because that's what my program does.

It was sure nice to be able to pivot & drill at will earlier....

I load a mysql db, to be able to pre-calc new cases/deaths & rolling averages.

I split it into two workbooks now, a US level .xls & a state-level .xls. The state-level is driven by a parameterized query, so I can change the state at will & keep all the charts in sync.

The side benefit is I'm not working with a 100mb spreadsheet anymore! https://shawnbulen.com/forum/index.php?topic=4.0

albertsun commented 3 years ago

Thanks all for all your feedback and discussion on this, and for sharing with each other different ways of handling the data. We do want most people who are using Excel and Numbers to be able to keep doing what they were doing before, so we've created a new us-counties-recent.csv file https://github.com/nytimes/covid-19-data/blob/master/us-counties-recent.csv that has data for just the last 30 days. Hopefully this is enough for people who are looking at recent trends.

The existing file will continue unchanged, and both will get updated together.

MirrorCola commented 3 years ago

Thank you, that will work great for me.

And thanks to everyone else, I learned some new stuff in the dialogue.