jpmckinney / information_request_summaries_and_responses

Collects information request summaries and responses
MIT License
3 stars 0 forks source link

Information Request Summaries and Responses

All government bodies in Canada are subject to some freedom of information statutes. Some bodies publish summaries of completed information requests. Fewer publish responses to completed information requests. This repository contains scripts for aggregating what is available.

Dependencies

brew install media-info libtiff poppler
sudo PIP_REQUIRE_VIRTUALENV=false pip install csvkit

See also the dependencies of docsplit and pdfshaver. You may need to use this Homebrew formula for PDFium (see PR).

Scripts

The following scripts should be run in order to collect the dataset.

Download the single-file sources to the wip/ directory:

PYTHONWARNINGS=ignore bundle exec rake datasets:download

Or, download one jurisdiction:

bundle exec rake datasets:download jurisdiction=ca

Run the British Columbia, Newfoundland and Labrador, and municipal scripts to download the multiple-file sources.

Normalize the summaries to the summaries directory:

bundle exec rake datasets:normalize

Or, normalize one jurisdiction:

bundle exec rake datasets:normalize jurisdiction=ca

Reconcile NL's scraped data with its open data, rewriting its files in the summaries directory:

ruby ca_nl_scraper.rb -v -a reconcile

Validate values according to jurisdiction-specific rules:

bundle exec rake datasets:validate:values

Validate that the decision and the number of pages agree:

bundle exec rake datasets:validate:datasets

To find additional sources, search for datasets across multiple catalogs with Namara.io:

rake datasets:search query="freedom of information"

British Columbia

Note: British Columbia sometimes publishes an incorrect file size. We therefore calculate the correct value.

Download the metadata for responses:

ruby ca_bc_scraper.rb

openinfo.bc.ca sometimes redirects to another page then back to the original page which then returns HTTP 200. However, the cache has already stored a HTTP 302 response for the original page; the script therefore reaches a redirect limit. If a FaradayMiddleware::RedirectLimitReached error occurs, it is simplest to temporarily move the _cache directory. To avoid losing time due to a late error, it is best to scrape and import one month at a time.

for month in {7..12}; do echo 2011-$month; ruby ca_bc_scraper.rb -q -- date 2011-$month; done
for year in {2012..2015}; do for month in {1..12}; do echo $year-$month; ruby ca_bc_scraper.rb -q -- date $year-$month; done; done
ruby ca_bc_scraper.rb -q -- date `date +%Y-%m`

Download the attachments for responses (over 40 GB as of late 2015):

ruby ca_bc_scraper.rb -a download --no-cache

Determine which attachments definitely require OCR:

ruby ca_bc_scraper.rb -a compress

Upload the attachments as archives to S3:

AWS_BUCKET=… AWS_ACCESS_KEY_ID=… AWS_SECRET_ACCESS_KEY=… ruby_ca_bc_scraper.rb -a upload

Newfoundland and Labrador

Note: Newfoundland and Labrador publishes an incorrect number of pages for about one in ten files. We therefore calculate the correct value.

Download the metadata for responses:

ruby ca_nl_scraper.rb

Download the attachments for responses:

ruby ca_nl_scraper.rb -a download --no-cache

Determine which attachments definitely require OCR:

ruby ca_nl_scraper.rb -a compress

Upload the attachments as archives to S3:

AWS_BUCKET=… AWS_ACCESS_KEY_ID=… AWS_SECRET_ACCESS_KEY=… ruby_ca_nl_scraper.rb -a upload

Municipalities

Canada

The following scripts are only relevant to automating informal requests for disclosed records from Canada.

Get the alternate names of organizations to make corrections:

bundle exec rake ca:federal_identity_program > support/federal_identity_program.yml

Get the abbreviations of organizations to match across datasets:

bundle exec rake ca:abbreviations > support/abbreviations.yml

Get organizations' emails from the coordinators page:

bundle exec rake ca:emails:coordinators_page > support/emails_coordinators_page.yml

Get organizations' emails from the search page:

bundle exec rake ca:emails:search_page > support/emails_search_page.yml

Compare organizations' emails from different sources:

bundle exec rake ca:emails:compare > support/mismatches.csv

Build a histogram of number of summaries per organization:

bundle exec rake ca:histogram

Construct the URL of the web form of each summary:

bundle exec rake ca:urls:get > support/urls.yml

Compare the constructed URLs to the search page's URLs:

bundle exec rake ca:urls:validate

Adding a new jurisdiction

Notes

This project does not publish all data elements published by jurisdictions, primarily because they are of low value, hard to normalize, or unique to a jurisdiction.

Reference

statutes.csv
The names and URLs of all current freedom of information statutes in Canada.
keywords.csv
The keywords used to refer to freedom of information in Canada.

Resources

Ratings:

Policies:

Nomenclature

In terms of the prevalence of FOI versus ATI:

In other words, use whatever term you prefer.

Copyright (c) 2015 James McKinney, released under the MIT license