All government bodies in Canada are subject to some freedom of information statutes. Some bodies publish summaries of completed information requests. Fewer publish responses to completed information requests. This repository contains scripts for aggregating what is available.
brew install media-info libtiff poppler
sudo PIP_REQUIRE_VIRTUALENV=false pip install csvkit
See also the dependencies of docsplit and pdfshaver. You may need to use this Homebrew formula for PDFium (see PR).
The following scripts should be run in order to collect the dataset.
Download the single-file sources to the wip/
directory:
PYTHONWARNINGS=ignore bundle exec rake datasets:download
Or, download one jurisdiction:
bundle exec rake datasets:download jurisdiction=ca
Run the British Columbia, Newfoundland and Labrador, and municipal scripts to download the multiple-file sources.
Normalize the summaries to the summaries
directory:
bundle exec rake datasets:normalize
Or, normalize one jurisdiction:
bundle exec rake datasets:normalize jurisdiction=ca
Reconcile NL's scraped data with its open data, rewriting its files in the summaries
directory:
ruby ca_nl_scraper.rb -v -a reconcile
Validate values according to jurisdiction-specific rules:
bundle exec rake datasets:validate:values
Validate that the decision and the number of pages agree:
bundle exec rake datasets:validate:datasets
To find additional sources, search for datasets across multiple catalogs with Namara.io:
rake datasets:search query="freedom of information"
Note: British Columbia sometimes publishes an incorrect file size. We therefore calculate the correct value.
Download the metadata for responses:
ruby ca_bc_scraper.rb
openinfo.bc.ca sometimes redirects to another page then back to the original page which then returns HTTP 200. However, the cache has already stored a HTTP 302 response for the original page; the script therefore reaches a redirect limit. If a FaradayMiddleware::RedirectLimitReached
error occurs, it is simplest to temporarily move the _cache
directory. To avoid losing time due to a late error, it is best to scrape and import one month at a time.
for month in {7..12}; do echo 2011-$month; ruby ca_bc_scraper.rb -q -- date 2011-$month; done
for year in {2012..2015}; do for month in {1..12}; do echo $year-$month; ruby ca_bc_scraper.rb -q -- date $year-$month; done; done
ruby ca_bc_scraper.rb -q -- date `date +%Y-%m`
Download the attachments for responses (over 40 GB as of late 2015):
ruby ca_bc_scraper.rb -a download --no-cache
Determine which attachments definitely require OCR:
ruby ca_bc_scraper.rb -a compress
Upload the attachments as archives to S3:
AWS_BUCKET=… AWS_ACCESS_KEY_ID=… AWS_SECRET_ACCESS_KEY=… ruby_ca_bc_scraper.rb -a upload
Note: Newfoundland and Labrador publishes an incorrect number of pages for about one in ten files. We therefore calculate the correct value.
Download the metadata for responses:
ruby ca_nl_scraper.rb
Download the attachments for responses:
ruby ca_nl_scraper.rb -a download --no-cache
Determine which attachments definitely require OCR:
ruby ca_nl_scraper.rb -a compress
Upload the attachments as archives to S3:
AWS_BUCKET=… AWS_ACCESS_KEY_ID=… AWS_SECRET_ACCESS_KEY=… ruby_ca_nl_scraper.rb -a upload
Halifax: Download summaries:
ruby ca_ns_halifax_scraper.rb
Markham: Download summaries and documents:
ruby ca_on_markham_scraper.rb
ruby ca_on_markham_scraper.rb -a download --no-cache
Ottawa: Download summaries:
ruby ca_on_ottawa_scraper.rb
The following scripts are only relevant to automating informal requests for disclosed records from Canada.
Get the alternate names of organizations to make corrections:
bundle exec rake ca:federal_identity_program > support/federal_identity_program.yml
Get the abbreviations of organizations to match across datasets:
bundle exec rake ca:abbreviations > support/abbreviations.yml
Get organizations' emails from the coordinators page:
bundle exec rake ca:emails:coordinators_page > support/emails_coordinators_page.yml
Get organizations' emails from the search page:
bundle exec rake ca:emails:search_page > support/emails_search_page.yml
Compare organizations' emails from different sources:
bundle exec rake ca:emails:compare > support/mismatches.csv
Build a histogram of number of summaries per organization:
bundle exec rake ca:histogram
Construct the URL of the web form of each summary:
bundle exec rake ca:urls:get > support/urls.yml
Compare the constructed URLs to the search page's URLs:
bundle exec rake ca:urls:validate
requests_source.rb
rake datasets:download
wip
, and add an entry to TEMPLATES
NON_CSV_SOURCES
*_normalize
methodrake datasets:normalize
and make corrections if necessaryRE_INVALID
and RE_DECISIONS
integer_formatter
on values if possiblesummaries
, and add an entry to datasets:validate:values
rake datasets:validate:values
and make corrections if necessaryidentifiers.md
summaries
, and add position
to the entry in TEMPLATES
if possiblerake datasets:normalize
if position
was addedThis project does not publish all data elements published by jurisdictions, primarily because they are of low value, hard to normalize, or unique to a jurisdiction.
Ratings:
Policies:
In terms of the prevalence of FOI versus ATI:
In other words, use whatever term you prefer.
Copyright (c) 2015 James McKinney, released under the MIT license