rgarner / cma-tna-crawlers

Scraping old cases from TNA for CMA, no TLAs.
0 stars 3 forks source link

There are some duplicate cases in the output #30

Closed rgarner closed 9 years ago

rgarner commented 9 years ago

Because some cases were available at more than one URL, there are duplicates in the output. We were going to use OFT's Ref field to de-dupe, but this field was removed from the spreadsheets.

Report on them to start with, and see if there's a strategy for removing them. If they occur in markets or mergers, they'll have a body generated for them, but if they occur elsewhere, they won't - so prefer markets or mergers in those cases.

rgarner commented 9 years ago

Closing, as from a preliminary report I can't see that this is a problem - we're only taking JSON with "modified_by_sheet": true, and these duplicates occur in JSON not modified by the sheets.