Map charge descriptions to crosswalk categories

bepetersn commented 10 years ago

For statutes with multiple IUCR possibilties, make csv mapping between charge descriptions and crosswalk categories. Also to IUCR codes, if an obvious mapping can be made, and eventually to our own categories of interest.

bepetersn commented 10 years ago

@ghing,

It turns out that charges descriptions have more reliable mappings to IUCR categories than ILCS statutes, but they're still not perfect. 15.6% of unique charge descriptions map to multiple IUCR categories.

Under Miscellaneous convictions data I've uploaded two JSON files:

one containing the mapping of all charge descriptions to the IUCR categories they appear with in our data
one containing the same mapping, filtered to ones where there are multiple IUCR categories for a given charge description.

More to come...

ghing commented 10 years ago

@bepetersn. I'll take a look at this today. We might just have to make our own call for mapping the ambiguous ones and document it clearly.

ghing commented 10 years ago

@bepetersn I'm a little confused by these. For example, this mapping from chrgdesc_to_category__multiples.json:

    "ATT.FORGERY": [
        "Battery",
        "Burglary"
    ],

I don't see why this description would match to either of these categories. Are the JSON files based on ILCS -> IUCR, or on charge description or a combination? Do you have code that implements your methodology for generating these mappings.

bepetersn commented 10 years ago

See my code at: https://github.com/sc3/cook-convictions-data/pull/10

Several notes:

I looked at dispositions here, instead of convictions, if that matters. My database is out of date in terms of some the "final" fields you added, and as a result I couldn't figure out how to access these fields from the Conviction model.
With regard to your question about that example of "ATT.FORGERY" in the charge description field and "Battery" and "Burglary" in the IUCR categories, it is weird. Running the SQL on my database, however, I can confirm that this is what I see.
Finally, if my belief is correct, this mapping from charge descriptions to IUCR categories doesn't represent any dispositions that couldn't be given an IUCR code, where I think we got the category from. Thus it doesn't yet fulfill our goal of getting more categories from the data. Realizing this now, I need to iterate on this to build a more full list by starting with statutes again.

ghing commented 10 years ago

@bepetersn I'll take a look at #10.

Thanks for clarifying the "ATT. FORGERY" issue. It sounds like there might be some disconnect for some records between the statute and the chargedesc. I'll take a quick peek and let you know what I find.

It shouldn't matter that you looked at dispositions since that's the source for the convictions anyway. It just means that you're looking through more records.

The mappings from charge description to IUCR category won't capture records that didn't have an IUCR code calculated from statute.

I think the first step moving forward would be to start making our own mapping from (final)_chargedesc to IUCR categories based on looking at the values of the charge description. For instance, "ATT. FORGERY" seems like it clearly maps to the "ATTEMPTED FORGERY" IUCR category. I think we could figure out mappings for most descriptions. Does this make sense to you?

ghing commented 10 years ago

@bepetersn, FYI I've uploaded a recent snapshot of the database to drive.

ghing commented 10 years ago

@bepetersn, I took a look at just the "ATT.FORGERY" case. It seems like there might have been some difficulty parsing the statute field to get the IUCR code/category which your management command was using to grab the categories. This makes sense because, at least for these dispositions, it looks like they tried to cram two different statutes into one field. :crying_cat_face:

dispositions = Disposition.objects.filter(final_chrgdesc="ATT.FORGERY")
for d in dispositions:
    print(d.case_number, d.final_statute)

The output is:

2007CR1475501 720-5/8-4(720-5/17-3(A)(1)
2007CR1475501 720-5/8-4(720-5/17-3(A)(1)
2007CR1475501 720-5/8-4(720-5/17-3(A)(1)
2008CR0853801 720-5/8-4(720-5/17-3(A)(2)
2008CR0853801 720-5/8-4(720-5/17-3(A)(2)
2008CR1582701 720-5(8-4)/17-3(A)(2)

720-5/17-3 maps to forgery, so I'm not sure where the Battery and Burglary mappings are coming from.

This makes me think that some of the more weird mappings might be due to parse issues. I wonder if we're better off making our own map of chargedesc to category.

Have you done any more digging into this?

bepetersn commented 10 years ago

Hey @ghing, I did some more digging. The majority of the multiples that we saw before were of the type that you said: coming from parsing errors, whether in the ILCS or IUCR modules. After removing a bug in my chrgdesc2category command (which was allowing some instances of parsing errors to go through to my generated list of multiples), I found that the number of multiples went down to just 45, from around 265.

I believe I also got cases where there were multiple IUCRs associated with a charge description but all with the same IUCR category to feed into the mapping.

Finally, after adding a check to make sure the category is found in the IUCR crosswalk along the lines of what I talked about in #14, the number of multiples went down to 3 (I might be able to get it to none).

I need to run a check to see how many of the convictions I'm actually able to reliably account for using this new mapping of charge descriptions to IUCR categories, but I'm somewhat hopeful. For now, the new chrgdesc_to_category__all.json and __multiples.json files are on the Drive folder.

I'm also going to upload my code tonight.

bepetersn commented 10 years ago

So the number of convictions for which I was able to successfully make a one-to-one mapping from its charge description to its IUCR category was 80.39% this time, or 27,743 missed records. A little bit worse, but I think we could make it better.

ghing commented 10 years ago

@bepetersn, let's hold of on working on this further until I finish a pass on my drug queries so we can figure out the best approach for this. I think we'll want to focus on our areas of interest rather than trying to get a clean category for every charge.

bepetersn commented 10 years ago

Ok. You should see the two new files, though. I've mostly got the mapping created. The multiples are 246 items long. We can really easily roll most of them up into the categorizations you are defining (most of them are going to map to a property crime, a few to sexual assault, etc.) The other 1300-some charge descriptions map to a single category, and we should be able to decide how to roll up these single categories into property/sexual/drug/violent really easily too.

The only thing I really want to do still is turn these JSON files into a CSV table.

ghing commented 10 years ago

@bepetersn, ok. I'll take a look at the new multiples file.

ghing commented 10 years ago

I've been doing some fixes to ILCS statute parsing and also looked through the duplicates and made mappings in this spreadsheet.

In many cases the mapping is genuinely ambiguous but we should be able to map them to our broader categories: violent, property, drug, index/nonindex, etc.

sc3 / cook-convictions-data

Map charge descriptions to crosswalk categories #6