invinst / chicago-police-data

a collection of public data re: CPD officers involved in police encounters
https://invisible.institute/police-data
157 stars 60 forks source link

Condense the May data further #44

Closed evanwsun closed 8 years ago

evanwsun commented 8 years ago

I haven't looked at April and February as much, but the May data can be condensed to roughly 10% of its current volume by simply eliminating the duplicate rows created by each IPRA officer on an investigation. An idea of how I formatted my condensed May data can be found here under the Condensed sheet. Note that I managed to condense the cases from ~25,000 to ~2600 just by eliminating the IPRA duplicate data. The reason that this could be useful is it allows people without coding experience to sift through the data better, increasing the ease of access.

DGalt commented 8 years ago

Is the data just duplicated for the same CRID but different investigators or are there cases where there are different entries for different investigators within a single case?

evanwsun commented 8 years ago

From briefly scanning through the data, I only saw cases where the data was duplicated for the same CRID but different investigators. I'm not sure what would be the best way to go about performing a more thorough search.

DGalt commented 8 years ago

I can go ahead and do that, just wasn't sure if you had. Shouldn't be too much work to write a little script to break the dataset down by CRIDs and check if values in each column are all the same

evanwsun commented 8 years ago

I have a copy of the concise data made using google apps script, but I haven't proofed it for accuracy. From what I can tell, it looks accurate, but it's easy to miss something with such a large data set.