sc3 / cookcountyjail

A Django app that tracks the population of Cook County Jail over time and summarizes trends.
http://cookcountyjail.recoveredfactory.net/api/1.0/?format=json
Other
31 stars 23 forks source link

Determine way to parse and categorize charges #113

Open eads opened 11 years ago

eads commented 11 years ago

The citation associated with an inmate are potentially interesting. Perhaps we should create a Google spreadsheet and do some good old fashioned research! It would be most excellent to link each charge to the appropriate section of the lawbooks online, too.

bepetersn commented 11 years ago

I have wanted to do this since I found out about this project. Seeing analysis of the different charges and their frequencies would be extremely interesting.

wilbertom commented 11 years ago

I'm looking into this. The scraping is going to be pretty severe. Should this be it's own app. So having countyapi and ilcsapi

bepetersn commented 11 years ago

At least pre- our conversion to the 2.0 API, I don't think it should be separate. Doesn't it make sense to start it off as a parse_charges function analogous to parse_location?

bepetersn commented 10 years ago

I've kind of changed my mind on this. Scraping, parsing, categorizing, and relating among different charges is a huge project, definitely outside the scope of the present one.

We can start to compile our own little Google spreadsheet, though; in fact, we definitely should, it's a place to start. We should just be conscious that any effort towards this is also effort towards something much bigger. For instance of what kinds of projects I think this one could develop into, look at http://chicagocode.org

bepetersn commented 10 years ago

Here are my thoughts with regard to the work we have to do and the decisions we have to make. Consider that I think we want to categorize / parse these charges, with the goal being to end up with a list of all possible charges that we know about for our database. (We are currently missing a separate "charges" API for v1.0, like the "housinglocation" or "courtlocation" APIs; see issue #266).

Given that, can we make this charge object based just on the "charges_citation", the code of law, or does the "charges" field, the more commentary-like one, contribute to the uniqueness of the charge? (some are very generic, e.g. "POSS. CANNABIS"). We also have to consider that many charges citations have a long string of numbers after their code (e.g. "[201929]"), which while we might want to keep it (I don't know what it represents yet), it would make uniquify-ing our charges much more difficult.

Further, even if we uniquify to this point, we have to consider that charges can actually be broken down further than simply their entire code. If we get a charge like "625 ILCS 5/11-501(d)(1)(C)", I think we should attempt to break this down by article, section, sub-section, all the way, or at any rate, to get the most abstract and most specific information out of it as possible. This will cause difficulty if some part or another is missing because of irregularity in input format / specificity, which I think exists.

This is in fact a canonical parsing problem we're coming up against. If we look at enough of the dataset we're dealing with, we can form top-down models for interpretation of that data, based on our experience with it. So then if we get rich enough info, we can actually make intelligent guesses about what some of these partially entered charges might be. I think that's the end goal for parsing / categorizing charges.

bepetersn commented 10 years ago

Hey, I just had yet another thought for this thread. I was reflecting on how to tie our jail data to other datasets (See issue #26), and I bothered to actually look at one--the Crime Incident data released by the city. I hadn't thought to find anything particularly germane, but it just occurred to me that trying to integrate our own representation of charges data with the Incident Data's representation of charges (theirs is more abstract) would provide a natural way for us to go about categorizing, parsing, and representing this data.

Two examples. We will definitely be able to abstract away from our "charges citation" objects based on the "Primary Type", or from the "IUCR" field associated with the crime data. Primary type, I believe, is a well-defined classification system that the police have for crimes. IUCR is another such well-defined classification system, that says something to the seriousness of a crime. "0110" IUCR is primary type "HOMICIDE", for instance. As I was saying, we would be forced to do these kinds of conversions with our charges before we could even try to compare these two datasets in this way, because primary type is the most specific piece of data they even give about the charge with the Crime Data.

However, we would definitely find out something interesting, journalistically, if we did this. The obvious one is the relationship between crime incidents, and time spent in County. But from a developer's perspective, this is also the obvious way to go. We need to provide more context to our charges, and this is one way to do it.

bepetersn commented 10 years ago

@eads, how do you think we can convert our charges data into "Primary Type" data, compatible with the Crime Incident Report data?

eads commented 10 years ago

Lets talk about this today.

Categorization is not beyond the scope of our project. In fact it's really important. On Mar 22, 2014 11:58 AM, "Brian Everett Peterson" notifications@github.com wrote:

@eads https://github.com/eads, how do you think we can convert our charges data into "Primary Type" data, compatible with the Crime Incident Report data?

— Reply to this email directly or view it on GitHubhttps://github.com/sc3/cookcountyjail/issues/113#issuecomment-38357257 .

bepetersn commented 10 years ago

LIst of "primary types" of crimes, with number of them commited: https://data.cityofchicago.org/Public-Safety/Summary-by-Primary-Type/yvjw-hzem

bepetersn commented 10 years ago

Talked to @tjakester about creating a map between charges citations and primary types ... it seems the only way to go about it may be doing some heavy digging through the directives of the CPD. Based on this resource which Tracy pointed me to, I found these:

http://directives.chicagopolice.org/directives/data/a7a57b38-13c72636-be313-c726-bae43e716aeaf35c.html?hl=true

and

http://directives.chicagopolice.org/directives/data/a7a57bf0-12d7196c-11f12-d71a-3c76ad6f2c11950a.html?hl=true

bepetersn commented 10 years ago

Here is a list of what I am calling "malformed charges citations", pulled from our production database: https://docs.google.com/spreadsheets/d/1bHVl9n-AD5ECr3HVrFRhwnzQWj8XKVr0Rb2_VpbQGD4/edit#gid=0

They are actually just the result of running all the citations through a regex that represents the most common form the charges come in. Using this regex, I was able to match 94.2% of the total citations we have seen, and only 80% of the total unique citations.

By these definitions, there are 194 "malformed" charges. Of these, there are two really common formats. One of them, after googling around, I think we can parse it into the most common format. It uses a reference to a page or something like that where the statute is actually held. You can figure it out if you google one of these, and see where it takes you. Examples of this:

38-21 38-21-3 38-12-4.2 56.5-1401(c)(7)(ii) 38-12-2-(a)(13) 95.5-6-210

Another format, I am not sure how to parse into the common format yet. It looks like this:

190400 ILCS 72 50 159375 ILCS 72 50 192800 ILCS 72 50

The rest are either edge cases of my regex, or are basically unhelpful in categorizing a charge, like an empty string "000", "UNKNOWN", sometimes what looks like a charge description.

bepetersn commented 10 years ago

Having worked with these charges and tried to parse them for a bit, I have thought of how we might eventually break these citations down into several fields. Here is an example of a "common format" citation:

720 ILCS 570 402(c)

becomes...

But I don't think we need to do this right now.