Scrape all information and hosted media from portal.iprachicago.org

rajivsinclair commented 8 years ago

Scrape published info

extract all published information per log number
store it in a useful data structure saved as CSV (comma-separated values) files
Capture media files
download raw files from Vimeo and SoundCloud then upload them to archive.org
download all PDF files and upload them to DocumentCloud
Create an index table
create a CSV table of archived media with the public URL for each media asset, the media type (audio/video/etc.), and the log number for the incident.
Make it easily search-able (see #8)
create a searchable table in html - thanks @derekeder for this JavaScript template
submit a pull request to host it on GitHub pages in this sandbox repo

banoonoo2 commented 8 years ago

I'm going to start work on the first item, the CSV. I'll grab all the IPRA fields plus note the number of related audio and video files.

Update 6/5 12:35 a.m.: I see there are also PDFs. I'll note the number of them and also download them, since you have to click through on the IPRA website to get to them. The officer data we're after is in these PDFs. They're image PDFs, so everything will have to be retyped. Since there's more than one officer per incident, I think this should go in a separate table with a line for each officer, with Incident Number to tie them back to the other incident data.

Update 6/5 2:02 a.m.: The PDFs are helpfully labeled on the IPRA site, which is lost when I download them without noting it. I'm going to stop downloading until the next step.

Update 6/5 3:19 a.m.: "Scrape published info" Complete Here's a file with the first step done. It's an XLSX because GitHub wouldn't let me upload a CSV here. invinst-shootings-data-info-per-log-number.xlsx

Update 6/5 5:00 a.m.: "Create an index table" and "Capture media files" In Progress I'm working steps two and three: downloading the files and creating a table that captures the labels/captions as well as everything else. Many of the earlier video files do not have the option to download from Vimeo. I know there are some plugins that allow you to grab/record videos. If someone wants to do that to download these videos, let me know, and I will get you a list of the ones needed.

Update 6/5 7:00 a.m. I've done steps two and three for the first 20 incidents (over 400 files!). I want to clean up the file names before uploading them to archive.org. I need some rest before I do that or continue with incidents 21+. Here's a file with the start of the table: invinst-shootings-data-archived-media-list-1042532-to-1065582.xlsx

rajivsinclair commented 8 years ago

I forgot to mention that we should upload the PDF documents to DocumentCloud will automatically OCR them and attempt to imperfectly extract recognizable entities.

banoonoo2 commented 8 years ago

Ooh, DocumentCloud is awesome-sauce! I've uploaded 168 PDFs to DocumentCloud and around 160 audio files to Archive.org at https://archive.org/details/iprashootings06032016. Those cover the oldest ~20 incidents. (I think 2 or 3 audio files may not have uploaded, but I'm going to double back for QA once everything's done.)

I will keep working on "Create an index table". I think it's good to have one person focused on that for consistency. It's pretty quick work to grab the links to items. Update 6/5 7:36 p.m.: Halfway done! Here's an updated file: invinst-shootings-data-archived-media-list-1042532-to-1071635.xlsx. Look for new files on DocumentCloud and Archive.org soon. Update 6/6 1:47 a.m.: Calling it a night. Here's the latest file: invinst-shootings-data-archived-media-list-1042532-to-1076367.xlsx.

If you want to help, please work on "Capture media files." It's proving very time- and bandwidth-intensive to try to do it at the same time as creating the index table. For Vimeo and Soundcloud, just follow the links in Rajiv's OP and hit the download button over and over. For the PDFs, you have to click into each incident. Please post here so I know you're doing it. Update: Also see my note above about needing to use a browser plugin to grab/record many of the older Vimeo files.

banoonoo2 commented 8 years ago

"Create Index Table" and "Capture Media Files Nearly Done" This version includes all the files for the 101 incidents in the June 3 dump.

It lists:

Original IPRA URL for each media asset
Media type (audio/video/document)
Log number for the incident

It still needs:

Public URLs (DocumentCloud/Archive.org) for media files

invinst-shootings-data-archived-media-list-all-1042532-to-1079743.xlsx

This evening I'll work on:

Uploading the remaining media files to DocumentCloud/Archive.org
Adding the public URLs to the index
QAing that all available media files have been uploaded, based on whether I can provide a public URL

@rajivsinclair There should be some discussion around these items:

Downloading is not allowed for almost all the videos. Is it worthwhile to have someone use a browser plugin to capture them all?
Further manual data-capture from media files to complement what DocumentCloud can accomplish, e.g.:
- Audio and video names. Many are the same as the file name, but some provide source/view info (e.g., 911 call, OEMC call, dispatch call, Portillo's entrance, third-party cellphone, lockup camera 3).
- Audio and video synopses with any officer names/badge numbers heard.
- Document fields, such as officer data, victim data, supervisor/also-on-scene data.

b-meson commented 8 years ago

some data already here https://github.com/redshiftzero/ipradata/tree/master/pdfs via @redshiftzero's IPRA scraper

banoonoo2 commented 8 years ago

I've uploaded all 617 PDFs to DocumentCloud, so feel free to start searching these files (try searching "TEXT: Sierra"). https://www.documentcloud.org/search/Project:%20%22IPRA%20Shootings%20June%203rd%22

I'm waiting on audio and video files to finish uploading to Archive.org. Then I'll be able to add their URLs to complete the index table.

Once those individual files are uploaded, I'm also planning to zip up the files and make them available for bulk download so no one else has to do all this clickety clicking to get them all!

banoonoo2 commented 8 years ago

@freddymartinez9 @redshiftzero That's pretty cool! I've managed to manually download all the PDFs and audio. Is there any way the scraper can grab videos without download links off of Vimeo? There are 326 videos total, and I was only able to download 7 of them.

b-meson commented 8 years ago

@banoonoo2 why is scraping video not an option? It actually seems to me that it best, I can pull all the metadata (log number, for example). looking through the source, it looks like this is a simple iframe that loads it from player.vimeo.com

banoonoo2 commented 8 years ago

@freddymartinez9 Basically I've been doing things manually because (1) I'm not savvy to scraping software beyond commercially available blunt instruments like SiteSucker and (2) I was concerned about crashing IPRA's little WordPress site and rendering everything unavailable. But now (1) you have the know-how and (2) I've captured everything easily available from IPRA's site and (3) the stuff that needs to be scraped is from Vimeo (need titles, metadata, and files) and SoundCloud (need titles and metadata only, have files), which should be able to stand up to lots of requests, so go for it!

b-meson commented 8 years ago

@banoonoo2 @rajivsinclair I am working on an IPRA scraper in my local repo here, it needs to be able access URI and from there, scape all videos from the IPRA vimeo account https://github.com/freddymartinez9/shootings-data/blob/master/IPRAVideoScaper.rb

rajivsinclair commented 8 years ago

@freddymartinez9 you may also want to include the summary abstracts that IPRA publishes each month

here are some summary abstracts that were extracted using the far-from-perfect scraper built into CPDB

banoonoo2 commented 8 years ago

✓ Scrape published info - Completed 6/5 ✓ Capture media files - Completed 6/9 ✓ Create an index table - Completed 6/9

I've posted CSVs and documentation on Slack, and I uploaded all the downloadable files to DocumentCloud and Archive.org.

I'm ready to pass the torch to somebody who has the patience for JavaScript!

alexsoble commented 8 years ago

Wow — thank you @banoonoo2! Looking into this now. Hugely appreciate the work.

alexsoble commented 8 years ago

@banoonoo2 Quick question about DocumentCloud. When I click this link, it redirects me to the DocumentCloud homepage: https://www.documentcloud.org/search/Project:%20%22IPRA%20Shootings%20June%203rd%22

Am I looking at the wrong link, or is there not a public URL yet? Did you have to create an account with DocumentCloud to upload documents?

b-meson commented 8 years ago

yes link is broken @alexsoble

banoonoo2 commented 8 years ago

@alexsoble @freddymartinez9 Sorry! Try this link: https://www.documentcloud.org/public/search/%22Project%20ID%22:%20%2227319%22 Or go to the DocumentCloud homepage, click the "Search" option at the top, select "PROJECT ID:" from the search box dropdown, and enter "27319" to search for the project by its unique ID number. It should pull up 626 documents: the count is shown at the bottom of the page.

Yes, RS invited me to the DocumentCloud project, and I had to create an account to upload the documents. That original link works when I'm logged in. I made all the documents public, so the search above and the URLs in the indices work without being logged in. I'm a DocumentCloud n00b, so I don't know about creating a proper public URL for the project. Maybe @rajivsinclair can help?

DGalt commented 8 years ago

I've gone ahead and created what is essentially a summary file for the the data that we have from April and May combined with the these files, organized (grouped) by the CRIDs. So, if this were searchable, essentially you'd be able to search a CRID and it'll give fields with lists of the information available for that CRID sourced from the April and May dumps, as well as the two files @banoonoo2 generated.

You can see the file here, a visual might be more helpful than what I'm describing: https://github.com/DGalt/shootings-data/blob/dev/summary_ipra.csv

I need to still incorporate the Februrary data, but April and May cover 98 of the 102 CRIDs available in the 2016-06-03-complaint-index file (the document-index file only has 87 CRIDs, which are all covered by May and April data dumps)

alexsoble commented 8 years ago

@banoonoo2 Thanks for the updated link! I dropped it into the wiki but obviously it could use more love and explanation. Do you have the ability to edit the wiki @banoonoo2? I think that could be a good place for us to put public-facing work and resources especially for folks who have a lot to contribute but would rather skip the PR workflow because.

alexsoble commented 8 years ago

Wow, thanks @DGalt! You mention making it searchable . . . I wonder if this table is a good candidate for using CSV-to-HTML Table? http://derekeder.github.io/csv-to-html-table/

banoonoo2 commented 8 years ago

@alexsoble I edited the wiki page to add some detail about the archive links.

DGalt commented 8 years ago

@alexsoble that's what I was thinking. I was going to take a crack at it this week, although javascript is definitely not one of my strengths ;)

alexsoble commented 8 years ago

Thanks @banoonoo2! Excellent documentation! And apologies if I was a bit confusing with the Slack vs. PR vs. Wiki stuff. We're figuring it out! 👍

alexsoble commented 8 years ago

@DGalt OK cool. I'll break it out into a separate issue. And no worries! I don't have much background in Python or R but I've built a few projects with Javascript, so maybe we can learn from each other. 😊

jayqi commented 8 years ago

Just to make sure this thread is up-to-date, I've whipped up a script to scrape the information and links. Still need to make this more robust and add functionality to track changes over time.

https://github.com/jayqi/ipra-portal-scraper

invinst / chicago-police-data

Scrape all information and hosted media from portal.iprachicago.org #6

Scrape published info

Capture media files

Create an index table

Make it easily search-able (see #8)