Open rajivsinclair opened 8 years ago
I'm going to start work on the first item, the CSV. I'll grab all the IPRA fields plus note the number of related audio and video files.
Update 6/5 12:35 a.m.: I see there are also PDFs. I'll note the number of them and also download them, since you have to click through on the IPRA website to get to them. The officer data we're after is in these PDFs. They're image PDFs, so everything will have to be retyped. Since there's more than one officer per incident, I think this should go in a separate table with a line for each officer, with Incident Number to tie them back to the other incident data.
Update 6/5 2:02 a.m.: The PDFs are helpfully labeled on the IPRA site, which is lost when I download them without noting it. I'm going to stop downloading until the next step.
Update 6/5 3:19 a.m.: "Scrape published info" Complete Here's a file with the first step done. It's an XLSX because GitHub wouldn't let me upload a CSV here. invinst-shootings-data-info-per-log-number.xlsx
Update 6/5 5:00 a.m.: "Create an index table" and "Capture media files" In Progress I'm working steps two and three: downloading the files and creating a table that captures the labels/captions as well as everything else. Many of the earlier video files do not have the option to download from Vimeo. I know there are some plugins that allow you to grab/record videos. If someone wants to do that to download these videos, let me know, and I will get you a list of the ones needed.
Update 6/5 7:00 a.m. I've done steps two and three for the first 20 incidents (over 400 files!). I want to clean up the file names before uploading them to archive.org. I need some rest before I do that or continue with incidents 21+. Here's a file with the start of the table: invinst-shootings-data-archived-media-list-1042532-to-1065582.xlsx
I forgot to mention that we should upload the PDF documents to DocumentCloud will automatically OCR them and attempt to imperfectly extract recognizable entities.
Ooh, DocumentCloud is awesome-sauce! I've uploaded 168 PDFs to DocumentCloud and around 160 audio files to Archive.org at https://archive.org/details/iprashootings06032016. Those cover the oldest ~20 incidents. (I think 2 or 3 audio files may not have uploaded, but I'm going to double back for QA once everything's done.)
I will keep working on "Create an index table". I think it's good to have one person focused on that for consistency. It's pretty quick work to grab the links to items. Update 6/5 7:36 p.m.: Halfway done! Here's an updated file: invinst-shootings-data-archived-media-list-1042532-to-1071635.xlsx. Look for new files on DocumentCloud and Archive.org soon. Update 6/6 1:47 a.m.: Calling it a night. Here's the latest file: invinst-shootings-data-archived-media-list-1042532-to-1076367.xlsx.
If you want to help, please work on "Capture media files." It's proving very time- and bandwidth-intensive to try to do it at the same time as creating the index table. For Vimeo and Soundcloud, just follow the links in Rajiv's OP and hit the download button over and over. For the PDFs, you have to click into each incident. Please post here so I know you're doing it. Update: Also see my note above about needing to use a browser plugin to grab/record many of the older Vimeo files.
"Create Index Table" and "Capture Media Files Nearly Done" This version includes all the files for the 101 incidents in the June 3 dump.
It lists:
It still needs:
invinst-shootings-data-archived-media-list-all-1042532-to-1079743.xlsx
This evening I'll work on:
@rajivsinclair There should be some discussion around these items:
some data already here https://github.com/redshiftzero/ipradata/tree/master/pdfs via @redshiftzero's IPRA scraper
I've uploaded all 617 PDFs to DocumentCloud, so feel free to start searching these files (try searching "TEXT: Sierra"). https://www.documentcloud.org/search/Project:%20%22IPRA%20Shootings%20June%203rd%22
I'm waiting on audio and video files to finish uploading to Archive.org. Then I'll be able to add their URLs to complete the index table.
Once those individual files are uploaded, I'm also planning to zip up the files and make them available for bulk download so no one else has to do all this clickety clicking to get them all!
@freddymartinez9 @redshiftzero That's pretty cool! I've managed to manually download all the PDFs and audio. Is there any way the scraper can grab videos without download links off of Vimeo? There are 326 videos total, and I was only able to download 7 of them.
@banoonoo2 why is scraping video not an option? It actually seems to me that it best, I can pull all the metadata (log number, for example). looking through the source, it looks like this is a simple iframe that loads it from player.vimeo.com
@freddymartinez9 Basically I've been doing things manually because (1) I'm not savvy to scraping software beyond commercially available blunt instruments like SiteSucker and (2) I was concerned about crashing IPRA's little WordPress site and rendering everything unavailable. But now (1) you have the know-how and (2) I've captured everything easily available from IPRA's site and (3) the stuff that needs to be scraped is from Vimeo (need titles, metadata, and files) and SoundCloud (need titles and metadata only, have files), which should be able to stand up to lots of requests, so go for it!
@banoonoo2 @rajivsinclair I am working on an IPRA scraper in my local repo here, it needs to be able access URI and from there, scape all videos from the IPRA vimeo account https://github.com/freddymartinez9/shootings-data/blob/master/IPRAVideoScaper.rb
@freddymartinez9 you may also want to include the summary abstracts that IPRA publishes each month
here are some summary abstracts that were extracted using the far-from-perfect scraper built into CPDB
✓ Scrape published info - Completed 6/5 ✓ Capture media files - Completed 6/9 ✓ Create an index table - Completed 6/9
I've posted CSVs and documentation on Slack, and I uploaded all the downloadable files to DocumentCloud and Archive.org.
I'm ready to pass the torch to somebody who has the patience for JavaScript!
Wow — thank you @banoonoo2! Looking into this now. Hugely appreciate the work.
@banoonoo2 Quick question about DocumentCloud. When I click this link, it redirects me to the DocumentCloud homepage: https://www.documentcloud.org/search/Project:%20%22IPRA%20Shootings%20June%203rd%22
Am I looking at the wrong link, or is there not a public URL yet? Did you have to create an account with DocumentCloud to upload documents?
yes link is broken @alexsoble
@alexsoble @freddymartinez9 Sorry! Try this link: https://www.documentcloud.org/public/search/%22Project%20ID%22:%20%2227319%22 Or go to the DocumentCloud homepage, click the "Search" option at the top, select "PROJECT ID:" from the search box dropdown, and enter "27319" to search for the project by its unique ID number. It should pull up 626 documents: the count is shown at the bottom of the page.
Yes, RS invited me to the DocumentCloud project, and I had to create an account to upload the documents. That original link works when I'm logged in. I made all the documents public, so the search above and the URLs in the indices work without being logged in. I'm a DocumentCloud n00b, so I don't know about creating a proper public URL for the project. Maybe @rajivsinclair can help?
I've gone ahead and created what is essentially a summary file for the the data that we have from April and May combined with the these files, organized (grouped) by the CRIDs. So, if this were searchable, essentially you'd be able to search a CRID and it'll give fields with lists of the information available for that CRID sourced from the April and May dumps, as well as the two files @banoonoo2 generated.
You can see the file here, a visual might be more helpful than what I'm describing: https://github.com/DGalt/shootings-data/blob/dev/summary_ipra.csv
I need to still incorporate the Februrary data, but April and May cover 98 of the 102 CRIDs available in the 2016-06-03-complaint-index file (the document-index file only has 87 CRIDs, which are all covered by May and April data dumps)
@banoonoo2 Thanks for the updated link! I dropped it into the wiki but obviously it could use more love and explanation. Do you have the ability to edit the wiki @banoonoo2? I think that could be a good place for us to put public-facing work and resources especially for folks who have a lot to contribute but would rather skip the PR workflow because.
Wow, thanks @DGalt! You mention making it searchable . . . I wonder if this table is a good candidate for using CSV-to-HTML Table? http://derekeder.github.io/csv-to-html-table/
@alexsoble I edited the wiki page to add some detail about the archive links.
@alexsoble that's what I was thinking. I was going to take a crack at it this week, although javascript is definitely not one of my strengths ;)
Thanks @banoonoo2! Excellent documentation! And apologies if I was a bit confusing with the Slack vs. PR vs. Wiki stuff. We're figuring it out! 👍
@DGalt OK cool. I'll break it out into a separate issue. And no worries! I don't have much background in Python or R but I've built a few projects with Javascript, so maybe we can learn from each other. 😊
Just to make sure this thread is up-to-date, I've whipped up a script to scrape the information and links. Still need to make this more robust and add functionality to track changes over time.
Scrape published info
Capture media files
Create an index table
Make it easily search-able (see #8)