jueyang / call-me-maybe

Use the issue queue. Dark secrets welcome. (CUNY-J teaching 2015)
3 stars 0 forks source link

.CSV Data Set won't completely open in Excel #26

Closed EvWasLike closed 9 years ago

EvWasLike commented 9 years ago

Hey Jue Yang,

I'm working with a data set from NYC's Open Data (this one specifically: https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9#column-menu)

I wanted to refine it to only include data that's filtered for a few specific parameters, which works fine within the NYC Open Data site. But as soon as I try to open it within Excel (whether as a CSV or XLSX file), it tells me that it can't completely open the file.

I want to analyze the data and potentially use pivot tables with it. How can I make it completely accessible so I don't miss anything? Is there any application or tool you'd recommend to do so?

Thank you!

jueyang commented 9 years ago

Good question! Since you are looking at a dataset that records every single 311 request from 2010 to present, you can imagine (and probably have seen) how big the dataset is. I didn't export the whole thing, but the part I downloaded before canceling is 1 GB (by the way, that alone would be the equivalent of streaming an entire episode of Parks and Recs on Netflix, in HD.)

Excel has a limit on file size depending on what system you are using and how much memory you have in your computer. Even in some miracle that you opened the file completely, doing a pivot table on these hundreds of thousands of rows would undoubtedly freeze your computer.

There are two ways around it.

  1. Make your file size (much) smaller. An easier way to do this is to filter the data before you download it. For example, when you limit your dataset with a created date from 09/01/2015 to 09/30/2015, you get a dataset of manageable size. (You will create an account in the portal first, and save the dataset as a new view. That's what I was advised by the website when I attempted.)
  2. you can try pre-processing data with Command Line. If you have a mac, you can follow some of the pointers here to produce relevant CSV (rather than an all-encompassing one that no one can open). You can learn the basics of the Command Line Interface (CLI) with the links here.

Use a combination of the two if you can! CLI is super handy when you are processing large quantities of data (and like everything else, takes practice.) But if you are strapped by time, I'd recommend just using a small dataset.

EvWasLike commented 9 years ago

Hey Jue Yang,

Thanks for the help. I ended up scrapping the idea for that data set because it was so large. I wanted to find a set that could be more easily organized and was able to you. Your advice was definitely helpful though.

Thanks!