Fix non-unique report_ids

divergentdave commented 9 years ago

This PR is for tracking implementation of the first phase of #160. The following scrapers need their derivation of report_id fixed.

[x] agriculture
[x] amtrak
[x] commerce
[x] dhs
[x] education
[x] eeoc
[x] exim
[x] hhs
[x] hud
[x] itc
[x] neh
[x] nsf
[x] pbgc
[x] sba
[x] sec
[x] state
[x] treasury
[x] lsc
[x] nasa
[x] nrc
[x] fdic
[x] usaid

For testing, you can run a scraper with --archive --dry_run --quick, and then running unique_report_ids.py.

Note that the above script will catch any duplicate report_ids on disk, assuming they are in different years. However, if there's a scraper that uses the same report_id twice in the same year, the second report will clobber the first one. We'll need something like 0a2d33af192a41cd229c5caa2cf59aad67b34634 from #151 to catch those cases at runtime.

divergentdave commented 9 years ago

No changes are needed for the State Department OIG scraper, I assume I had stale files from both before and after 18c00d6a128be9ab5c1e2774bd3da0a3e11dfb55. Running with a clean data directory doesn't produce any duplicate IDs.

divergentdave commented 9 years ago

So it turns out that the Office of Evaluation and Inspections pages on the HHS site list the same reports on multiple pages, even with different dates. For example, "Medicare's Reimbursement for Interpretations of Hospital Emergency Room X-Rays" appears on the X page, under X-Rays, with a date, and on the E page, under Emergency Rooms, without a date.

My plan is to add a special case for crawling oei/subject_index.asp, so that we collect all the links on all the pages, group them by URL, look for a date across all of the text, and then download the report once.

I may be busy over the next several days, so feel free to take this and run with it.

divergentdave commented 9 years ago

Okay, I finished the original list of issues, ran the scrapers again, and found a few more stragglers.

Duplicate report_id 'ctf' in data\lsc\1999\ctf\report.json, data\lsc\2000\ctf\report.json
Duplicate report_id 'ctf' in data\lsc\1999\ctf\report.json, data\lsc\2000\ctf\report.json, data\lsc\2001\ctf\report.json
Duplicate report_id 'ctf' in data\lsc\1999\ctf\report.json, data\lsc\2000\ctf\report.json, data\lsc\2001\ctf\report.json, data\lsc\2002\ctf\report.json
Duplicate report_id 'ctf' in data\lsc\1999\ctf\report.json, data\lsc\2000\ctf\report.json, data\lsc\2001\ctf\report.json, data\lsc\2002\ctf\report.json, data\lsc\2003\ctf\report.json
Duplicate report_id 'NA' in data\nasa\2011\NA\report.json, data\nasa\2013\NA\report.json
Duplicate report_id 'NA' in data\nasa\2011\NA\report.json, data\nasa\2013\NA\report.json, data\nasa\2014\NA\report.json
Duplicate report_id 'NUREG-BR-0304-v1n1' in data\nrc\2000\NUREG-BR-0304-v1n1\report.json, data\nrc\2003\NUREG-BR-0304-v1n1\report.json
Duplicate report_id 'NUREG-BR-0304-v1n2' in data\nrc\2000\NUREG-BR-0304-v1n2\report.json, data\nrc\2003\NUREG-BR-0304-v1n2\report.json
Duplicate report_id 'NUREG-BR-0304-v2n1' in data\nrc\2001\NUREG-BR-0304-v2n1\report.json, data\nrc\2004\NUREG-BR-0304-v2n1\report.json
Duplicate report_id 'NUREG-BR-0304-v3n1' in data\nrc\2002\NUREG-BR-0304-v3n1\report.json, data\nrc\2005\NUREG-BR-0304-v3n1\report.json
Duplicate report_id 'NUREG-BR-0304-v3n2' in data\nrc\2002\NUREG-BR-0304-v3n2\report.json, data\nrc\2005\NUREG-BR-0304-v3n2\report.json
Duplicate report_id 'NUREG-BR-0304-v4n1' in data\nrc\2003\NUREG-BR-0304-v4n1\report.json, data\nrc\2007\NUREG-BR-0304-v4n1\report.json
Duplicate report_id '139264' in data\state\2010\139264\report.json, data\state\2011\139264\report.json
Duplicate report_id '145259' in data\state\2010\145259\report.json, data\state\2012\145259\report.json
Duplicate report_id '145823' in data\state\2010\145823\report.json, data\state\2012\145823\report.json

Duplicate report_id '00-007' in data\fdic\1999\00-007\report.json, data\fdic\2000\00-007\report.json
Duplicate report_id 'oigsemi-03-09' in data\fdic\2003\oigsemi-03-09\report.json, data\fdic\2009\oigsemi-03-09\report.json

divergentdave commented 9 years ago

Okay, that's the last of the checkboxes here. Given how large this PR has gotten, I suggest we cut it off here, review and merge, and take up the second part in another branch. (i.e. checking at runtime for duplicate IDs in the same year)

parkr commented 9 years ago

Awesome work, @divergentdave! Lots of futzing around singular issues (like one report being mis-named or improperly linked) – think there's a way to extract the "oddities" out of the main code and isolate them so we have a go-to place to look for strangeness? Just thinking out loud about cleanliness.

konklone commented 9 years ago

@parkr It's a tough problem for unique IDs, because they can't efficiently be detected on the fly, and so you can't check at validation time per-report. You have to look at the whole corpus at once, after they've been downloaded.

@divergentdave has written an excellent script to do just that, along with several other corpus-analyzing scripts, and I'd like to get them all into this repo and instrumented so that archive managers can run them automatically as ritual QA.

divergentdave commented 9 years ago

As far as the small special cases go, I've done what I can to take out two or three birds with one stone, but I don't foresee any neat overarching abstractions. The code that works around typos on websites could go away if the underlying issues get fixed. Some of the additional logic, like distinguishing between a report and a follow-up report with the same number, boils down to human intuition. Plus, it's often important that these tweaks go in one particular point in the code, i.e. after the report_id has first been assigned but before it gets used.

konklone commented 9 years ago

Thanks again for doing the legwork here, @divergentdave -- really happy to see this level of data QA on the project.

unitedstates / inspectors-general

Fix non-unique report_ids #165