invinst / chicago-police-data

a collection of public data re: CPD officers involved in police encounters
https://invisible.institute/police-data
157 stars 60 forks source link

Compare incident counts and overlap between data sources #11

Closed alexsoble closed 6 years ago

alexsoble commented 8 years ago

Via https://github.com/invinst/shootings-data/issues/4#issuecomment-223840565:

One simple way to start wrapping our heads around the data might be to start by gathering up incident counts from each of the four data sets we care about: February, April, May, and June.

We could also find out how much overlap exists between the incident IDs in each pair of data sets.

That could let us answer questions like:

"Can the April data set alone let us link most of the June 3 incidents to CPD officers?"

DGalt commented 8 years ago

Are the incident numbers (i'm assuming this is the "compaint_number" column) unique forever, or should we be concerned about them rolling over / getting reused at any point in time

alexsoble commented 8 years ago

This is a great question! @rajivsinclair / @ithinkidunno, do you know? On Tue, Jun 7, 2016 at 2:30 PM DGalt notifications@github.com wrote:

Are the incident numbers (i'm assuming this is the "compaint_number" column) unique forever, or should we be concerned about them rolling over / getting reused at any point in time

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/invinst/shootings-data/issues/11#issuecomment-224388432, or mute the thread https://github.com/notifications/unsubscribe/ADD5HcCc_bPXxRd3nljvXEG8sMv4l-W5ks5qJcbIgaJpZM4IwKul .

alexsoble commented 8 years ago

Hi @DGalt, the complaint numbers are unique forever says @ithinkidunno.

DGalt commented 8 years ago

Of the 362 unique Complaint_Numbers (CRIDs) in the May dataset, 201 of them do not exist as unique Complaint_Numbers in April

alexsoble commented 8 years ago

@DGalt Great, added to the wiki.

Do folks think it's useful to do these comparisons between each dataset or nah? Chaclyn seemed interested in what @DGalt found at Code for Boston.

jayqi commented 8 years ago

Here are some of my counts of unique CRID numbers for each dataset. I also broke down the April and May ones by the year (based on the year on the filename or sheetname).

June dump 102
May FOIA 360
    2016 5
    2015 29
    2014 47
    2013 43
    2012 47
    2011 50
    2010 38
    2009 44
    2008 57
April FOIA 7175
    2016 330
    2015 1392
    2014 1663
    2013 1913
    2012 1877
February FOIA 405

Seems like I'm missing two compared to @DGalt for the May set. Possibly because I'm querying the "Incid" sheets and not the "Parties" sheets.

jilmun commented 8 years ago

@jayqi For the May2016 data, the only CRID that you should be missing if you're using "incid" tabs only is "1052142" from 2012.

Yr Cnt 2016 5 2015 29 2014 47 2013 43 2012 48 2011 50 2010 38 2009 44 2008 57 Total 361

Here's my list of unique CRID's. @DGalt If your total is 362, let me know which one I'm missing. IPRA-May2016-CRID.txt <-- this is really a CSV file...

DGalt commented 8 years ago

@yahwes 361 is correct, I must have forgotten to drop nans from the list when I counted the other night, which is what lead to the extra unique value.