This repo crawls The National Archives (hereafter TNA) for these closed sites:
This solution is part crawler, part CSV-based. We get occasional spreadsheets
from CMA as .xslx
files. These are exported as CSV and placed in
./sheets.
The crawlers should be run first, followed by augment_from_sheet
, then any
body generators.
There are executable scripts in bin
that will run the crawlers. By default,
cases will be saved to a dir _output
relative to whatever dir you run them
from. For example, running
bin/crawl_cc
...from here will produce an _output
directory with one JSON file per case and
one directory named for the case containing any PDFs associated with that case.
The crawlers are whitelist crawlers - they only follow links we say are of
interest via Anemone's focus_crawl
. In the case of both CC and OFT sites we
can tell for any URL in the page what type of thing it will link to, and
that makes the crawl time significantly quicker - we can tell whether to follow
an href
without having to dereference it. The downside is some fairly funky
regular expressions in each Crawler
class. Sorry about that, clarity fans.
crawl_cc
Crawls the CC site. It collates body copy from different pages into a
markup_sections
hash. When finished, you should run generate_cc_bodies
.
crawl_oft_mergers
Crawls the mergers for years from the start URLs in CMA::OFT::Mergers::Crawler
.
it creates JSON files with summaries and links to the PDF assets describing the decisions. No body is created.
it collates markup sections describing the decisions in a similar manner to the CC crawler. We will need a body generator for these mergers, so we'll need to come up with some rules.
crawl_oft_current
Crawls competition/cartels, markets and consumer enforcement cases. Note that we will need to remove markets from this and put them in their own crawler, as only 14 closed markets cases will go over.
crawl_oft_completed
Adds to output from crawl_current
by looking at the completed pages. Some
cases exist on these pages that aren't listed in the year case lists at
crawl_oft_current
.
generate_cc_bodies
Generates summary
and body
for each piece of CC json from collated
markup sections according to some formatting rules.
generate_mergers_bodies
TBD
For current status of work, please check the Issues.