unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Enforce uniqueness of report_id #151

Closed divergentdave closed 10 years ago

parkr commented 10 years ago

Looks great! Might be worth fetching them from the filesystem if possible, too. :+1:

konklone commented 10 years ago

This really only works if the entire IG is being archived at once. For example, if ./inspectors/usps.py were being run with the default arguments, it would only cover the current year, 2014, and so the built-up dict would only verify that the current year's report_id was unique.

I think the right approach is as @parkr suggests, that it analyze the state of the filesystem. I had envisioned this as a separate task, that runs holistically against the entire current contents on disk, and not something run during the scrape. In some ways, it's easier architecturally to just build it in to the scrape, and make it known that by running an archival scrape you can get this behavior -- but it also breaks the implicit guarantee of idempotency that the scrapers otherwise have. In other words, you could run a scraper once with some group of args (X), and it's fine, then run again with args Y, and it's fine, but then run it with args X, and suddenly there's an error now.

I think there's room in the architecture for a separate validation/analytics script, that looks at the contents of disk and gives you some advice. Analyzing uniqueness of report_ids is a great place to start with that.

divergentdave commented 10 years ago

I wrote a quick and dirty script to check for reports with colliding report_ids from on-disk JSON files, here's the current list of collisions.

Duplicate report_id 'IGtestimony110302' in data\agriculture\2003\IGtestimony110302\report.json, data\agriculture\2011\IGtestimony110302\report.json
Duplicate report_id '005-2013' in data\amtrak\2013\005-2013\report.json, data\amtrak\2014\005-2013\report.json
Duplicate report_id 'OIG-11-018-T' in data\commerce\2010\OIG-11-018-T\report.json, data\commerce\2011\OIG-11-018-T\report.json
Duplicate report_id 'DA-07-04' in data\dhs\2003\DA-07-04\report.json, data\dhs\2006\DA-07-04\report.json
Duplicate report_id 'DD-07-04' in data\dhs\2003\DD-07-04\report.json, data\dhs\2007\DD-07-04\report.json
Duplicate report_id 'DD-07-05' in data\dhs\2005\DD-07-05\report.json, data\dhs\2007\DD-07-05\report.json
Duplicate report_id 'DD-07-06' in data\dhs\2006\DD-07-06\report.json, data\dhs\2007\DD-07-06\report.json
Duplicate report_id 'DA-08-04' in data\dhs\2004\DA-08-04\report.json, data\dhs\2008\DA-08-04\report.json
Duplicate report_id 'OIG-08-18' in data\dhs\2008\OIG-08-18\report.json, data\dhs\2010\OIG-08-18\report.json
Duplicate report_id 'DA-12-04' in data\dhs\2004\DA-12-04\report.json, data\dhs\2012\DA-12-04\report.json
Duplicate report_id 'DA-13-04' in data\dhs\2004\DA-13-04\report.json, data\dhs\2012\DA-13-04\report.json
Duplicate report_id 'DS-13-05' in data\dhs\2005\DS-13-05\report.json, data\dhs\2013\DS-13-05\report.json
Duplicate report_id 'OIG-13-81' in data\dhs\2012\OIG-13-81\report.json, data\dhs\2013\OIG-13-81\report.json
Duplicate report_id 'A17D0002' in data\education\2002\A17D0002\report.json, data\education\2003\A17D0002\report.json
Duplicate report_id '2007-03-amp' in data\eeoc\2006\2007-03-amp\report.json, data\eeoc\2007\2007-03-amp\report.json
Duplicate report_id 'Semiannual Report to Congress - October 1, 2010 to March 31, 2011' in data\exim\2010\Semiannual Report to Congress - October 1, 2010 to March 31, 2011\report.json, data\exim\2011\Semiannual Report to Congress - October 1, 2010 to March 31, 2011\report.json
Duplicate report_id 'Semiannual Report to Congress - October 1, 2011 to March 31, 2012' in data\exim\2011\Semiannual Report to Congress - October 1, 2011 to March 31, 2012\report.json, data\exim\2012\Semiannual Report to Congress - October 1, 2011 to March 31, 2012\report.json
Duplicate report_id 'OIG-AR-12-02' in data\exim\2011\OIG-AR-12-02\report.json, data\exim\2013\OIG-AR-12-02\report.json
Duplicate report_id 'OIG_Report_Spring14_final_508' in data\exim\2013\OIG_Report_Spring14_final_508\report.json, data\exim\2014\OIG_Report_Spring14_final_508\report.json
Duplicate report_id 'oai-07-86-00079' in data\hhs\1986\oai-07-86-00079\report.json, data\hhs\1987\oai-07-86-00079\report.json
Duplicate report_id 'oei-05-90-00720' in data\hhs\1987\oei-05-90-00720\report.json, data\hhs\1990\oei-05-90-00720\report.json
Duplicate report_id 'oei-02-89-01490' in data\hhs\1989\oei-02-89-01490\report.json, data\hhs\1993\oei-02-89-01490\report.json
Duplicate report_id 'oei-07-91-01470' in data\hhs\1992\oei-07-91-01470\report.json, data\hhs\1994\oei-07-91-01470\report.json
Duplicate report_id 'oei-01-97-00050' in data\hhs\1997\oei-01-97-00050\report.json, data\hhs\1999\oei-01-97-00050\report.json
Duplicate report_id 'oei-01-97-00051' in data\hhs\1997\oei-01-97-00051\report.json, data\hhs\1999\oei-01-97-00051\report.json
Duplicate report_id 'oei-01-97-00052' in data\hhs\1997\oei-01-97-00052\report.json, data\hhs\1999\oei-01-97-00052\report.json
Duplicate report_id 'oei-01-97-00053' in data\hhs\1997\oei-01-97-00053\report.json, data\hhs\1999\oei-01-97-00053\report.json
Duplicate report_id 'oei-02-97-00522' in data\hhs\1997\oei-02-97-00522\report.json, data\hhs\1999\oei-02-97-00522\report.json
Duplicate report_id 'oei-09-97-00121' in data\hhs\1997\oei-09-97-00121\report.json, data\hhs\1999\oei-09-97-00121\report.json
Duplicate report_id 'oei-09-97-00122' in data\hhs\1997\oei-09-97-00122\report.json, data\hhs\1999\oei-09-97-00122\report.json
Duplicate report_id 'oei-01-97-00054' in data\hhs\1997\oei-01-97-00054\report.json, data\hhs\2000\oei-01-97-00054\report.json
Duplicate report_id 'oei-01-99-00160' in data\hhs\1999\oei-01-99-00160\report.json, data\hhs\2000\oei-01-99-00160\report.json
Duplicate report_id 'oei-02-97-00527' in data\hhs\1997\oei-02-97-00527\report.json, data\hhs\2000\oei-02-97-00527\report.json
Duplicate report_id 'oei-02-99-00340' in data\hhs\1999\oei-02-99-00340\report.json, data\hhs\2000\oei-02-99-00340\report.json
Duplicate report_id 'oei-05-99-00290' in data\hhs\1999\oei-05-99-00290\report.json, data\hhs\2000\oei-05-99-00290\report.json
Duplicate report_id 'oei-09-99-00550' in data\hhs\1999\oei-09-99-00550\report.json, data\hhs\2000\oei-09-99-00550\report.json
Duplicate report_id 'oei-09-89-00330' in data\hhs\1991\oei-09-89-00330\report.json, data\hhs\2003\oei-09-89-00330\report.json
Duplicate report_id 'oei-03-00-00031' in data\hhs\2002\oei-03-00-00031\report.json, data\hhs\2004\oei-03-00-00031\report.json
Duplicate report_id 'oei-02-91-00210' in data\hhs\1992\oei-02-91-00210\report.json, data\hhs\2005\oei-02-91-00210\report.json
Duplicate report_id 'oei-02-91-01510' in data\hhs\1992\oei-02-91-01510\report.json, data\hhs\2005\oei-02-91-01510\report.json
Duplicate report_id 'oei-03-91-00711' in data\hhs\1991\oei-03-91-00711\report.json, data\hhs\2005\oei-03-91-00711\report.json
Duplicate report_id '2009-AT-1001' in data\hud\2008\2009-AT-1001\report.json, data\hud\2009\2009-AT-1001\report.json
Duplicate report_id "Semiannual-Report-and-Chairman's-Transmittal-Letter-to-Congress" in data\itc\2012\Semiannual-Report-and-Chairman's-Transmittal-Letter-to-Congress\report.json, data\itc\2013\Semiannual-Report-and-Chairman's-Transmittal-Letter-to-Congress\report.json
Duplicate report_id "Semiannual-Report-and-Chairman's-Transmittal-Letter-to-Congress" in data\itc\2012\Semiannual-Report-and-Chairman's-Transmittal-Letter-to-Congress\report.json, data\itc\2013\Semiannual-Report-and-Chairman's-Transmittal-Letter-to-Congress\report.json, data\itc\2014\Semiannual-Report-and-Chairman's-Transmittal-Letter-to-Congress\report.json
Duplicate report_id 'OIG-09-01' in data\neh\2008\OIG-09-01\report.json, data\neh\2009\OIG-09-01\report.json
Duplicate report_id 'OIG-11-03' in data\neh\2010\OIG-11-03\report.json, data\neh\2011\OIG-11-03\report.json
Duplicate report_id 'OIG-13-01' in data\neh\2012\OIG-13-01\report.json, data\neh\2013\OIG-13-01\report.json
Duplicate report_id 'OIG-14-01' in data\neh\2013\OIG-14-01\report.json, data\neh\2014\OIG-14-01\report.json
Duplicate report_id 'oig0901' in data\nsf\1992\oig0901\report.json, data\nsf\2008\oig0901\report.json
Duplicate report_id 'N-A' in data\pbgc\2003\N-A\report.json, data\pbgc\2004\N-A\report.json
Duplicate report_id 'N-A' in data\pbgc\2003\N-A\report.json, data\pbgc\2004\N-A\report.json, data\pbgc\2005\N-A\report.json
Duplicate report_id 'N-A' in data\pbgc\2003\N-A\report.json, data\pbgc\2004\N-A\report.json, data\pbgc\2005\N-A\report.json, data\pbgc\2006\N-A\report.json
Duplicate report_id 'N-A' in data\pbgc\2003\N-A\report.json, data\pbgc\2004\N-A\report.json, data\pbgc\2005\N-A\report.json, data\pbgc\2006\N-A\report.json, data\pbgc\2007\N-A\report.json
Duplicate report_id 'EVAL-2011-1-PA-09-65' in data\pbgc\2010\EVAL-2011-1-PA-09-65\report.json, data\pbgc\2011\EVAL-2011-1-PA-09-65\report.json
Duplicate report_id '-Semiannual-Report-to-Congress-' in data\sba\2008\-Semiannual-Report-to-Congress-\report.json, data\sba\2010\-Semiannual-Report-to-Congress-\report.json
Duplicate report_id '283fin' in data\sec\1998\283fin\report.json, data\sec\1999\283fin\report.json
Duplicate report_id '139264' in data\state\2010\139264\report.json, data\state\2011\139264\report.json
Duplicate report_id '145259' in data\state\2010\145259\report.json, data\state\2012\145259\report.json
Duplicate report_id '145823' in data\state\2010\145823\report.json, data\state\2012\145823\report.json
Duplicate report_id 'IGATI' in data\treasury\2006\IGATI\report.json, data\treasury\2007\IGATI\report.json
konklone commented 10 years ago

Nice!! So wow, 17 of the scrapers produce non-unique IDs. And at least some are being scraped twice under different years? This points to some sort of bug:

Duplicate report_id 'Semiannual Report to Congress - October 1, 2010 to March 31, 2011' in data\exim\2010\Semiannual Report to Congress - October 1, 2010 to March 31, 2011\report.json, data\exim\2011\Semiannual Report to Congress - October 1, 2010 to March 31, 2011\report.json
Duplicate report_id 'Semiannual Report to Congress - October 1, 2011 to March 31, 2012' in data\exim\2011\Semiannual Report to Congress - October 1, 2011 to March 31, 2012\report.json, data\exim\2012\Semiannual Report to Congress - October 1, 2011 to March 31, 2012\report.json

Seems like the actionables here are:

divergentdave commented 10 years ago

FYI I had trouble getting things to work with my scripts in a subdirectory. Python doesn't like directory traversal except for certain cases inside child modules.

konklone commented 10 years ago

:( I've had the exact same issue, I assumed it was me being unfamiliar with Python import mechanics (I came to Python late in life). There's got to be a way to work around it...

konklone commented 10 years ago

Moving to #160.