invinst / chicago-police-data

a collection of public data re: CPD officers involved in police encounters
https://invisible.institute/police-data
157 stars 60 forks source link

How to relate Feb, April, and May data sets? #28

Closed DGalt closed 6 years ago

DGalt commented 8 years ago

A question that I keep coming up against is what is the most appropriate way, if at all, to combine and/or relate the different data sets that we have available to us. Just to briefly summarize what we have (this is in the wiki as well):

The unique-identifier column in February is Log No, while in April and May datasets it is Complaint_Number

Assuming that we can treat the values in Log No in the February dataset as equivalent to the values found in the Complaint_Number column found in the April and May datasets (@ChaclynHunt, @rajivsinclair can you confirm / refute this):

Overall the columns in the April and May sets largely correspond to each other, although there are several extra columns in May that do not exist in April (some work on trying to match the April and May columns can be found here, about halfway down the page)

February, in contrast, is a bit sparser in terms of overall data. While I think most of the columns in February can be matched to columns in April/May, there are a number of columns in April/May that do not exist in February.

One thing that might be worth looking in to is, for the unique identifiers that overlap across the datasets, how much of the data for those identifiers overlaps.

Considering, though, that these three datasets are produced by different sources (particularly reports of misconduct vs. the reports generated when an officer uses his/her firearm), I don't know that collapsing them into one large dataset is the best path forward. Or maybe it is, hence the need for discussion :)

alexsoble commented 8 years ago

Great point @DGalt. It seems to me there are about 3 basic ways to extract value from these:

All of these strike me as valid approaches to seeking out information & insights from the data.