Closed rgarner closed 9 years ago
Paging @adammaddison
I'm so on it...
Emailed Ruth.
From Ruth: I’m not sure what to say about the URLs below as I’m not sure why any of them need scraping (I don’t think they were listed on the sheets I sent over).
Close?
The crawlers start at various entry points, and we had to start writing them long before we got our first spreadsheet. If these URLs are not important, we can close this. Only today have we had all the draft spreadsheets wired in for the first time so we can identify the gaps between our crawl strategy and what's in them. The crawlers, at worst, should crawl more than what's in the sheets, but never less.
(put another way, the spreadsheets are a post-processing augment step where we overwrite some of what we crawled with the curated values)
Check the source for a list of URLs along with reasons why we're currently ignoring them (some reasons may be invalid):
Does not exist in TNA at this or other timestamps
http://www.oft.gov.uk/OFTwork/oft-current-cases/competition-case-list-2014/?Order=Date¤tLetter=A
Does not exist in TNA at this or other timestamps
http://www.oft.gov.uk/OFTwork/oft-current-cases/consumer-case-list-2013/air-travel
Not in current or completed cases pages, or linked to from anywhere except other case pages (Pegasus)
http://www.oft.gov.uk/OFTwork/consumer-enforcement/consumer-enforcement-completed/retirement-homes/
FIXME: No way to handle this - nested, not single CASE_DETAIL
http://www.oft.gov.uk/OFTwork/markets-work/hombuilding-updates
FIXME: No way to handle this - nested, not single CASE_DETAIL
http://www.oft.gov.uk/OFTwork/markets-work/secondhandcarsqanda
FIXME: Generalised help document
http://www.oft.gov.uk/OFTwork/markets-work/QandAs
FIXME: Generalised help document
http://www.oft.gov.uk/OFTwork/markets-work/homebuying-and-selling-QandAs
FIXME: Generalised help document
http://www.oft.gov.uk/OFTwork/markets-work/consumer-contracts-QandAs
FIXME: Generalised help document
http://www.oft.gov.uk/OFTwork/markets-work/market-studies-further-info