rgarner / cma-tna-crawlers

Scraping old cases from TNA for CMA, no TLAs.
0 stars 3 forks source link

Write a mergers crawler #4

Closed rgarner closed 9 years ago

rgarner commented 9 years ago

The bulk of the work, but hopefully the simplest in terms of parsing.

Entry points:

2014: http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/decisions/2014/ 2013: http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/decisions/2013/;jsessionid=5418169512CFCA88A3916FF3C3858883 2012: http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/decisions/2012/;jsessionid=C755805D7F9903F573C8F6EA1B26ADEF http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/decisions/2011/;jsessionid=3906243EDA2970E01139982FA49A7834 http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/decisions/2010/;jsessionid=DE397F83D0D269F74E71C65997D1A87C These dates are only available on several pages, alphabetically: http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2009/ http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2008/;jsessionid=871674049203A53FB656EDB17E9A1D86 http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2007/;jsessionid=00FD32C27E1DCAD5E0B238BE4CBA8152 http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2006/;jsessionid=4EBD5FEF7E216B26A8AB0C633A2CE69A http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2005/;jsessionid=E9A8B5875BFD20EE59C79A6AB59C1113 http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2004/;jsessionid=F0B8ACA1274985F31581979A866AD88D http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2003/;jsessionid=00CB1407A796EDDC256354FC245A3949

adammaddison commented 9 years ago

These links to the mergers cases might be easier to use than the ones I sent before.

2013: http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/decisions/2013/;jsessionid=3906243EDA2970E01139982FA49A7834 2012: http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/decisions/2012/;jsessionid=C755805D7F9903F573C8F6EA1B26ADEF 2011: http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/decisions/2011/;jsessionid=3906243EDA2970E01139982FA49A7834 2010: http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/decisions/2010/;jsessionid=DE397F83D0D269F74E71C65997D1A87C 2009: http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2009/?Order=Date&currentLetter=A 2008: http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2008/?Order=Date&currentLetter=A 2007: http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2007/?Order=Date&currentLetter=A 2006: http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2006/?Order=Date&currentLetter=A 2005: http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2005/?Order=Date&currentLetter=A 2004: http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2004/?Order=Date&currentLetter=A 2003: http://webarchive.nationalarchives.gov.uk/20140402142426/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2003/?Order=Date&currentLetter=A

rgarner commented 9 years ago

Shall be using these (jsessionid removed and timestamp harmonised with spreadsheets):

http://webarchive.nationalarchives.gov.uk/20140402141250/http://www.oft.gov.uk/OFTwork/mergers/decisions/2014/ http://webarchive.nationalarchives.gov.uk/20140402141250/http://www.oft.gov.uk/OFTwork/mergers/decisions/2013/ http://webarchive.nationalarchives.gov.uk/20140402141250/http://www.oft.gov.uk/OFTwork/mergers/decisions/2012/ http://webarchive.nationalarchives.gov.uk/20140402141250/http://www.oft.gov.uk/OFTwork/mergers/decisions/2011/ http://webarchive.nationalarchives.gov.uk/20140402141250/http://www.oft.gov.uk/OFTwork/mergers/decisions/2010/ http://webarchive.nationalarchives.gov.uk/20140402141250/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2009/?Order=Date&currentLetter=A http://webarchive.nationalarchives.gov.uk/20140402141250/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2008/?Order=Date&currentLetter=A http://webarchive.nationalarchives.gov.uk/20140402141250/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2007/?Order=Date&currentLetter=A http://webarchive.nationalarchives.gov.uk/20140402141250/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2006/?Order=Date&currentLetter=A http://webarchive.nationalarchives.gov.uk/20140402141250/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2005/?Order=Date&currentLetter=A http://webarchive.nationalarchives.gov.uk/20140402141250/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2004/?Order=Date&currentLetter=A http://webarchive.nationalarchives.gov.uk/20140402141250/http://www.oft.gov.uk/OFTwork/mergers/Mergers_Cases/2003/?Order=Date&currentLetter=A

rgarner commented 9 years ago

Crikey. Really not simple at all. 9 overlapping cases from 2009-2010 really made this hard work:

  ##
  # Nine subpages for cases that started 2009, but which appear
  # at new-style URLs (so would be incorrectly matched as 2010 cases
  # if left alone)
  SUBPAGE_NOT_CASE = %r{
    (london-stock-exchange|go-north-east|Aggregate|Koppers|arriva|
    co-op-psw|ambassador|co-operative1|phs-teacrate)
  }x
rgarner commented 9 years ago

Probably (mostly, for the purposes of being more specific about things that need to happen to it now) fixed in 1383521533168c539ee394ffa2ae7290cabe87ac

adammaddison commented 9 years ago

Hello.

Sorry haven't been speaking. Busy, interesting times.

Speak Monday. On 9 Jan 2015 17:57, "Russell Garner" notifications@github.com wrote:

Probably (mostly, for the purposes of being more specific about things that need to happen to it now) fixed in 1383521 https://github.com/rgarner/cma-tna-crawlers/commit/1383521533168c539ee394ffa2ae7290cabe87ac

— Reply to this email directly or view it on GitHub https://github.com/rgarner/cma-tna-crawlers/issues/4#issuecomment-69371931 .