invinst / chicago-police-data

a collection of public data re: CPD officers involved in police encounters
https://invisible.institute/police-data
157 stars 60 forks source link

Try to merge shootings-append.csv into shootings-merge.csv #4

Closed alexsoble closed 6 years ago

alexsoble commented 8 years ago

shootings-append.csv appends data from 2 data sets on police shootings released pursuant to FOIA requests submitted in April and May 2016.

This data is dynamic, not static. For example, the status or details about a police shooting may change as an investigation progresses. There may also be other inconsistencies between the data sets.

We need to merge the shootings-append.csv data into a combined file, shootings-merge.csv. You can use your GitHub username or initials to identify your attempt at merging the data. For example, Matt Li (fictional person) might name his solution shootings-merge-ml.csv.

Feel free to use your own tools to merge the data. One contributor is using Stata, but you can also use Python or R. We hope that through multiple parallel attempts at merging the data, we can catch each other's mistakes.

To quote @rajivsinclair: "We are all about redundancy. We can't have enough of that. Do it your own way, look at our start, look at our finish, and compare the results. Help us identify the flaws in our approach."

We are going to work on documenting shootings-append.csv so the meaning of each column is clear. Please share questions about the data in this thread.

jayqi commented 8 years ago
alexsoble commented 8 years ago

Hi @jayqi! Thanks for the excellent questions.

  1. These are NOT the exact same dataset. Many of the columns should (in theory) be the same from month to month. But because these files describe ongoing investigations, we expect some of the columns to change over time, such as CURRENT_CATEGORY and CURRENT_STATUS. Take a look at COLUMN-DICTIONARY.md, where we are currently working on documenting what each of the columns mean.

    My understanding of project goals (copy-pasting from the README): "Our first goal is to use the FOIA’ed shootings data here to identify officers per incident log number in the June 3 trove."

    But you raise interesting questions: is it crucial to merge the FOIA dumps before we try to link officers with incident log numbers? If so, why?

    Maybe we could skip the merging step if it lets us deliver value quicker. Curious to hear what you think @rajivsinclair @ithinkidunno.

jayqi commented 8 years ago

If the May set is more recent, why not just use the May set? What information is contained in the April set that the May set is missing?

alexsoble commented 8 years ago

@jayqi Yes, great question. Let's dive into the data to find out. My understanding from @rajivsinclair and @ithinkidunno is that there may be some missing information in the May set that could be complemented by merging with the April set. But I would agree that it couldn't hurt to start with the May set and see what we find.

alexsoble commented 8 years ago

Same with json. Your json suggestion sounds very sensible to me. I am curious what @rajivsinclair and @ithinkidunno think.

json would make a lot of sense for columns that we expect to change over time, too:

// Incident

{
  "involved officers": [ ... ],
  "accused officer": { ... },
  "categories": [
    {
      "date": ...,
      "category_code": ...,
      "category_text": ...
    },
    {
      "date": ...,
      "category_code": ...,
      "category_text": ...
    }
  ]
}
ithinkidunno commented 8 years ago

The May and April files are different. This is why I appended them rather than merging them

ithinkidunno commented 8 years ago

the April one has a bigger universe of complaints, whereas the May only had the Shootings.

alexsoble commented 8 years ago

OK, this is useful @ithinkidunno.

Strikes me that a couple simple counts would be useful:

... and how much overlap exists between the incident IDs in each pair of data sets.

That could let us answer questions like:

b-meson commented 8 years ago

These don't seem to have definitions which is off since these columns should have the most relevance for the merging. Don't we need more context (from COLUMN-DICTIONARY.md)

 __start_code_2:__
__start_code_1:__
__rec:__

edit: this is issue #9

alexsoble commented 8 years ago

@freddymartinez9 Yup, we definitely need to provide more context here. The column documentation should be complete (at least a first draft) by tonight.

alexsoble commented 8 years ago

@freddymartinez9 We finished documenting the columns in #13.

alexsoble commented 8 years ago

@jayqi Are you coming to ChiHackNight tonight? Loved the ideas you shared here, would be excellent to meet up and work together on this in person. Let me know. If you want to chat, my email is in my profile.

DGalt commented 8 years ago

I have another version of this that I can submit a PR for if you want, although we might want to wait until we have a consensus on how we're treating May. You can take a look at the columns for april and may that I combined / renamed here, though, in the section marked "Cleaning / Combining April and May":

https://github.com/DGalt/shootings-data/blob/dev/april_may_munging.ipynb

alexsoble commented 8 years ago

@DGalt Another version of shootings-append.csv? I think a PR for that would be very welcome. Did you get a chance to compare it with the original shootings-append.csv? Any major differences?

I noticed this line in the iPython notebook:

sht_apd = pd.read_csv('./Raw/shootings-append.csv', low_memory=False)

... which seems to be reading @ithinkidunno's append for a comparison attempt, but I didn't see the sht_apd variable later on. I may also be misreading this!

alexsoble commented 8 years ago

@DGalt The Exploring Data section is extremely interesting.

Couple of questions:

  1. I wonder how the results line up against https://cpdb.co/data/bP8dwb/citizens-police-data-project?
  2. In the last table, what should we make of nan?
  3. The last table seems to be complaints with the most involved officers, is that correct? Creative way to slice the data, I would never have thought of that. I'm curious about those incidents: 1055981, 1070415, 1055068...
DGalt commented 8 years ago

you're right, I load @ithinkidunno's shootings-append but don't use it again. This is a cleaned up version of an earlier notebook where I was comparing the two, but I didn't include the comparison here. One major difference is @ithinkidunno's file only includes a subset of the data available from the April data set that corresponds to shootings and tasing (these are the events where the incident code begins with either "18", "20", or "S"). I do not do that here, as it's not clear yet whether we're focusing on those incidents or all possible incidents. It would be easy enough, though, to filter those particular matches out of my file and compare the two different version.

As for your other questions:

  1. I haven't looked at the cpdp dataset, but it would definitely be interesting to compare the two.
  2. The nan values correspond to a nan in the accused_officer_fname or accused_officer_lname column. I should probably re-do this so that if there is nan in one of those columns but not the other that's actually represented in the data.
  3. Yes, that is correct. I can provide the full dataset (for both this and for sorting by the number of incidents each officer is involved in). If particular incidents look interesting it's easy enough to filter them out and look at them specifically from the combined dataset.
jayqi commented 8 years ago

Merging in another related issue:

Right now, shootings-append.csv includes only data from the April and May files.

The February file includes only three types of incident categories: 18A, 18B, 20A. It uses a very different column schema from the April and May files, and it has less information that the other two files.

We would like to merge it in to shootings-append.csv, but we need to figure out how to accurately map the column schemas onto each other given that February has less information.