Find out if OFT URLs we're explicitly ignoring are important

rgarner commented 9 years ago

Check the source for a list of URLs along with reasons why we're currently ignoring them (some reasons may be invalid):

Does not exist in TNA at this or other timestamps

http://www.oft.gov.uk/OFTwork/oft-current-cases/competition-case-list-2014/?Order=Date&currentLetter=A

Does not exist in TNA at this or other timestamps

http://www.oft.gov.uk/OFTwork/oft-current-cases/consumer-case-list-2013/air-travel

Not in current or completed cases pages, or linked to from anywhere except other case pages (Pegasus)

http://www.oft.gov.uk/OFTwork/consumer-enforcement/consumer-enforcement-completed/retirement-homes/

rgarner commented 9 years ago

Paging @adammaddison

adammaddison commented 9 years ago

I'm so on it...

adammaddison commented 9 years ago

Emailed Ruth.

adammaddison commented 9 years ago

From Ruth: I’m not sure what to say about the URLs below as I’m not sure why any of them need scraping (I don’t think they were listed on the sheets I sent over).

Close?

rgarner commented 9 years ago

The crawlers start at various entry points, and we had to start writing them long before we got our first spreadsheet. If these URLs are not important, we can close this. Only today have we had all the draft spreadsheets wired in for the first time so we can identify the gaps between our crawl strategy and what's in them. The crawlers, at worst, should crawl more than what's in the sheets, but never less.

rgarner commented 9 years ago

(put another way, the spreadsheets are a post-processing augment step where we overwrite some of what we crawled with the curated values)

rgarner / cma-tna-crawlers