unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Add EEOC #107

Closed spulec closed 10 years ago

spulec commented 10 years ago

Add the OIG for the U.S. Equal Employment Opportunity Commission.

audiodude commented 10 years ago

I meant to run this with --dry_run but I forgot, and I'm glad I did. When run without it, I get the following:

$ python inspectors/eeoc.py
[report][2014-03-31][oig-3-2014]
    report: eeoc/2014/oig-3-2014/report.cfm
Unknown file type, don't know how to extract metadata!
Unknown file type, don't know how to extract text!
    text: None
    data: eeoc/2014/oig-3-2014/report.json

It seems pretty obvious and straightforward what the problem is, the inspector utils don't know how to deal with .cfm files. We need to somehow let the utils know that it's an HTML file, the URL just has the wrong extension.

divergentdave commented 10 years ago

@audiodude, fixed the HTML heuristic in 1fa8f5d

konklone commented 10 years ago

Thanks @divergentdave, that's the right call. The scrapers should "manually" set the file_type field when it can't be auto-detected correctly. Very nice catch, @audiodude.

In fact, it seems to me the validator should choke when it auto-detects an unknown file type, like .cfm, and the scraper should be forced to supply a file_type. I opened up #109 for that.

konklone commented 10 years ago

Annoyingly, a whole bunch of the old semiannual reports are 404s:

and a bunch more. I'll report that to the EEOC OIG.

konklone commented 10 years ago

Anyway, this works great. I'm not letting #109 hold this up, as having report.cfm vs report.html on disk won't actually affect downstream clients, who only use the extracted report.txt version, and text extraction works now after @divergentdave's fix.

Thanks @spulec for kicking this off, and to @audiodude and @divergentdave for helping out!

020425_1468_0062_lslp

konklone commented 10 years ago

I forgot that the EEOC is the agency with the outstanding badge I found in May for my Sunlight blog post. It deserves a posting of its own:

sealbig