invinst / chicago-police-data

a collection of public data re: CPD officers involved in police encounters
https://invisible.institute/police-data
157 stars 60 forks source link

Display CSV table(s) as search-able HTML on GitHub Pages #8

Closed rajivsinclair closed 6 years ago

rajivsinclair commented 8 years ago
alexsoble commented 8 years ago

This is a fantastic idea, will make it easier for more people to dive into the data.

jayqi commented 8 years ago

I tried implementing this for shootings-append.csv using the template by @derekeder. Unfortunately, it seems like the dataset is too big for at least my browser to display all at once in this way.

Here's a version with only the first 9 entries from the data file. It works fine.

Here's a version that tries to load the full thing. My browser isn't able to load it (Chrome 50.0.2661.1 on OS X 10.11.3 on a mid 2015 rMBP).

rajivsinclair commented 8 years ago

Any simple alternatives? or should we not bother?

Anyone who wants to can import the CSV into Excel or Google Sheets or your preferred table-searching tool.

I’m just going to close this issue now, but feel free to open it up if you want to try another approach.

derekeder commented 8 years ago

Hey all, yah this file is 52k rows long (and 13mb), which is way too much to render on one page with my javascript template.

A google sheet would probably be the easiest approach.

alexsoble commented 8 years ago

I think a Google Sheet is a fantastic idea. Plenty of people don't know what a .csv is, get intimidated when it loads up thousands and thousands of rows, etc. Making an interface into the data that user friendly for non-technical folks seems like a good idea to me.

alexsoble commented 8 years ago

Re-opening!

Instead of a searchable table for all of shootings-append.csv, we can do a table for https://github.com/DGalt/shootings-data/blob/dev/summary_ipra.csv (only the June Dump).

This is much smaller and more manageable and would be useful. This would be a mirror of the IPRA portal's table, but with many more columns created by linking with the May + April data. Which I believe is what @rajivsinclair may have meant in the first place?

Reference: https://github.com/invinst/shootings-data/issues/6#issuecomment-225315398

alexsoble commented 8 years ago

^^ @DGalt

DGalt commented 8 years ago

@alexsoble will likely start working on this at hacknight tonight. Just a heads up, though, there's probably still a little cleaning that needs to go on with that csv. I have not looked at the cleaned February data yet, so I don't know what, if anything, should be pulled out of that and added to this.

alexsoble commented 8 years ago

Sounds good! On Tue, Jun 14, 2016 at 2:36 PM DGalt notifications@github.com wrote:

@alexsoble https://github.com/alexsoble will likely start working on this at hacknight tonight. Just a heads up, though, there's probably still a little cleaning that needs to go on with that csv. I have not looked at the cleaned February data yet, so I don't know what, if anything, should be pulled out of that and added to this.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/invinst/shootings-data/issues/8#issuecomment-225992122, or mute the thread https://github.com/notifications/unsubscribe/ADD5HecIMqlcLq-LVeiygAx2-yXjx2_Oks5qLwK8gaJpZM4IuYVV .

alexsoble commented 8 years ago

@DGalt I plugged your work-in-progress CSV into the template: https://github.com/invinst/ipra-table. It works fine. Takes a couple of seconds to load, and there's a lot of [nan] data, but it's a very serviceable first draft. Made it clear in the commit + repo that this a work in progress and not a final product. There's no public internet URL yet. We can look at this together at HackNight tonight if you want.

DGalt commented 8 years ago

After our discussion last night I think we need to decide whether we want to continue merging the april and may dumps, or whether we want to keep them separate. Looking at the list of 102 CRIDs in the June ipra dataset, 39 of them do not exist in the May data set, 8 of them do not exist in the April data set, and 4 of them do not exist in the data set composed of the combination of May and April. If we're just trying to provide as much information as possible then we should continue merging the two together, but as Chacyln explained last night these are data sets covering two different things.

As for your other question @alexsoble about percentage of nans in the columns, see below. This is using the combined April and May data set, merged with the files generated by @banoonoo2. The numbers displayed are percentage of rows in the corresponding column that are composed only of nan:

ipra_closed_date                100.000000
ipra_investigate_begin_date     100.000000
recommended_number_of_days       99.019608
discipline_code                  99.019608
penalty_status                   99.019608
finding_id                       96.078431
accused_detail                   94.117647
finding_code                     92.156863
accused_star                     74.509804
accused_appointment_date         72.549020
accused_officer_fname            72.549020
accused_officer_lname            72.549020
accused_position                 72.549020
accused_unit                     72.549020
involved_officer_detail          68.627451
involved_officer_description     60.784314
ipra_investigate_end_date        53.921569
involved_officer_type            38.235294
fileyr                           38.235294
iad_ops                          38.235294
cr_required                      38.235294
converted_rec_i                  38.235294
complaint_date                   38.235294
ipra_investigator_type           38.235294
ipra_assign_date_time            38.235294
police_shooting                  38.235294
police_shooting_no               38.235294
report_status                    38.235294
source                           38.235294
involved_officer_unit            14.705882
FILE_PROBLEMS                    11.764706
IPRA_URL                         11.764706
IPRA_PORTAL_LABEL                11.764706
MEDIA_FILE_TYPE                  11.764706
PUBLIC_URL                       11.764706
PUBLIC_FILE_NAME                 11.764706
ipra_investigator_lastname        8.823529
ipra_investigator_firstname       8.823529
involved_officer_age              8.823529
notification_date                 7.843137
initial_category_code             4.901961
initial_category                  4.901961
current_status                    3.921569
location_code                     3.921569
street                            3.921569
address                           3.921569
beat                              3.921569
current_category                  3.921569
current_category_code             3.921569
incident_time_start               3.921569
involved_officer_sex              3.921569
involved_officer_race             3.921569
involved_officer_position         3.921569
incident_time_end                 3.921569
involved_officer_lname            3.921569
involved_officer_fname            3.921569
SUBJECT_2                         0.000000
NOTIFICATION_DATE                 0.000000
DOCUMENT_COUNT                    0.000000
AUDIO_COUNT                       0.000000
VIDEO_COUNT                       0.000000
SUBJECT_3                         0.000000
JUVENILE_WITHHELD                 0.000000
HIPAA_WITHHELD                    0.000000
INCIDENT_TIME_START               0.000000
SUBJECT_1                         0.000000
DOCS_WITHHELD_NOTES               0.000000
INCIDENT_TYPE                     0.000000
SUBJECT_4                         0.000000

Edit: sorted above by percentage

alexsoble commented 8 years ago

Thanks @dgalt!

alexsoble commented 8 years ago

@dgalt A bunch of the rows have nan in ways I wouldn't necessarily expect, for example [nan 'RONALD'] in row 101 accused_officer_fname. Would you mind bringing me up to date when you have a minute on the best way to interpret these kinds of values? Is this a Python byproduct?

DGalt commented 8 years ago

Right, so, in the full data set there are likely several rows for any one CRID. The way I've organized this is that each CRID has a row, and each column contains a list of the unique values that were in all of the rows for that CRID (so basically the multiple rows are collapsed into a list for any one column). So in the example you give, that row corresponds to complaint_number 1080018.0. You're looking at accused_officer_fname, so in the uncollapsed data set there must have been two rows corresponding to complaint_number 1080018.0 that had in the accused_officer_fname column values of nan and RONALD - meaning that, for whatever reason, one of the entries in the original data set did not have an accused_officer_fname filled in. When I collapse everything for that CRID together you then get that list of [nan, 'RONALD'].

You'll see nan values in a lot of the lists within the various rows/columns - this is again simply because there was a missing value for that particular CRID in that particular column.

We should discuss, though, what to do with those nan values. We can simply drop them to make things look cleaner - I left them in 1. because it was just simpler and 2. because it gives a more accurate representation of how many unique rows are associated with any particular CRID. That may not be something we care about, though - it's relatively trivial to remove them, we just need to decide if that's what we want to do.

alexsoble commented 8 years ago

OK great, thanks for the explanation @DGalt!

alexsoble commented 8 years ago

Still WIP, working on a trimmed slice of the data with 5 columns, 28 rows (rows where we have accused officer names):

https://github.com/invinst/ipra-table/commit/892cdc99fc74b078eab8eaf07b60519146de8813

DGalt commented 8 years ago

how many columns can reasonably be presented before it becomes illegible?

i also wonder if these some way to truncate certain columns that are excessively long (e.g. the ones with urls)

alexsoble commented 8 years ago

Great question, not sure. More columns, more work for reader.