sc3 / cook-convictions-data

Django project for loading, cleaning and querying data about criminal convictions in Cook County, Illinois
4 stars 4 forks source link

Map charge descriptions to crosswalk categories #6

Open bepetersn opened 10 years ago

bepetersn commented 10 years ago

For statutes with multiple IUCR possibilties, make csv mapping between charge descriptions and crosswalk categories. Also to IUCR codes, if an obvious mapping can be made, and eventually to our own categories of interest.

bepetersn commented 10 years ago

@ghing,

It turns out that charges descriptions have more reliable mappings to IUCR categories than ILCS statutes, but they're still not perfect. 15.6% of unique charge descriptions map to multiple IUCR categories.

Under Miscellaneous convictions data I've uploaded two JSON files:

More to come...

ghing commented 10 years ago

@bepetersn. I'll take a look at this today. We might just have to make our own call for mapping the ambiguous ones and document it clearly.

ghing commented 10 years ago

@bepetersn I'm a little confused by these. For example, this mapping from chrgdesc_to_category__multiples.json:

    "ATT.FORGERY": [
        "Battery",
        "Burglary"
    ],

I don't see why this description would match to either of these categories. Are the JSON files based on ILCS -> IUCR, or on charge description or a combination? Do you have code that implements your methodology for generating these mappings.

bepetersn commented 10 years ago

See my code at: https://github.com/sc3/cook-convictions-data/pull/10

Several notes:

ghing commented 10 years ago

@bepetersn I'll take a look at #10.

Thanks for clarifying the "ATT. FORGERY" issue. It sounds like there might be some disconnect for some records between the statute and the chargedesc. I'll take a quick peek and let you know what I find.

It shouldn't matter that you looked at dispositions since that's the source for the convictions anyway. It just means that you're looking through more records.

The mappings from charge description to IUCR category won't capture records that didn't have an IUCR code calculated from statute.

I think the first step moving forward would be to start making our own mapping from (final)_chargedesc to IUCR categories based on looking at the values of the charge description. For instance, "ATT. FORGERY" seems like it clearly maps to the "ATTEMPTED FORGERY" IUCR category. I think we could figure out mappings for most descriptions. Does this make sense to you?

ghing commented 10 years ago

@bepetersn, FYI I've uploaded a recent snapshot of the database to drive.

ghing commented 10 years ago

@bepetersn, I took a look at just the "ATT.FORGERY" case. It seems like there might have been some difficulty parsing the statute field to get the IUCR code/category which your management command was using to grab the categories. This makes sense because, at least for these dispositions, it looks like they tried to cram two different statutes into one field. :crying_cat_face:

dispositions = Disposition.objects.filter(final_chrgdesc="ATT.FORGERY")
for d in dispositions:
    print(d.case_number, d.final_statute)

The output is:

2007CR1475501 720-5/8-4(720-5/17-3(A)(1)
2007CR1475501 720-5/8-4(720-5/17-3(A)(1)
2007CR1475501 720-5/8-4(720-5/17-3(A)(1)
2008CR0853801 720-5/8-4(720-5/17-3(A)(2)
2008CR0853801 720-5/8-4(720-5/17-3(A)(2)
2008CR1582701 720-5(8-4)/17-3(A)(2)

720-5/17-3 maps to forgery, so I'm not sure where the Battery and Burglary mappings are coming from.

This makes me think that some of the more weird mappings might be due to parse issues. I wonder if we're better off making our own map of chargedesc to category.

Have you done any more digging into this?

bepetersn commented 10 years ago

Hey @ghing, I did some more digging. The majority of the multiples that we saw before were of the type that you said: coming from parsing errors, whether in the ILCS or IUCR modules. After removing a bug in my chrgdesc2category command (which was allowing some instances of parsing errors to go through to my generated list of multiples), I found that the number of multiples went down to just 45, from around 265.

I believe I also got cases where there were multiple IUCRs associated with a charge description but all with the same IUCR category to feed into the mapping.

Finally, after adding a check to make sure the category is found in the IUCR crosswalk along the lines of what I talked about in #14, the number of multiples went down to 3 (I might be able to get it to none).

I need to run a check to see how many of the convictions I'm actually able to reliably account for using this new mapping of charge descriptions to IUCR categories, but I'm somewhat hopeful. For now, the new chrgdesc_to_category__all.json and __multiples.json files are on the Drive folder.

I'm also going to upload my code tonight.

bepetersn commented 10 years ago

So the number of convictions for which I was able to successfully make a one-to-one mapping from its charge description to its IUCR category was 80.39% this time, or 27,743 missed records. A little bit worse, but I think we could make it better.

ghing commented 10 years ago

@bepetersn, let's hold of on working on this further until I finish a pass on my drug queries so we can figure out the best approach for this. I think we'll want to focus on our areas of interest rather than trying to get a clean category for every charge.

bepetersn commented 10 years ago

Ok. You should see the two new files, though. I've mostly got the mapping created. The multiples are 246 items long. We can really easily roll most of them up into the categorizations you are defining (most of them are going to map to a property crime, a few to sexual assault, etc.) The other 1300-some charge descriptions map to a single category, and we should be able to decide how to roll up these single categories into property/sexual/drug/violent really easily too.

The only thing I really want to do still is turn these JSON files into a CSV table.

ghing commented 10 years ago

@bepetersn, ok. I'll take a look at the new multiples file.

ghing commented 10 years ago

I've been doing some fixes to ILCS statute parsing and also looked through the duplicates and made mappings in this spreadsheet.

In many cases the mapping is genuinely ambiguous but we should be able to map them to our broader categories: violent, property, drug, index/nonindex, etc.